Skip to content

perf: optimize SIMD loops with unsafe indexing#825

Draft
d-enk wants to merge 1 commit into
samber:masterfrom
d-enk:perf-simd-no-bounds-check
Draft

perf: optimize SIMD loops with unsafe indexing#825
d-enk wants to merge 1 commit into
samber:masterfrom
d-enk:perf-simd-no-bounds-check

Conversation

@d-enk
Copy link
Copy Markdown
Contributor

@d-enk d-enk commented Mar 2, 2026

Replace slice-based access (base[i:i+lanes]) with unsafeIndexVec to eliminate runtime.panicBounds checks in critical SIMD loops.

Changes:

  • Replace unsafeSliceInt8/16/32/64 and unsafeSliceUint* with unsafeIndexVec
  • Add unsafeIndexBase and unsafeIndexOffset for optimal performance
  • Update Sum*, Mean*, Min*, Max*, Contains*, Clamp* functions
  • Remove unused unsafeSlice* helper functions

Generated code size: ~96 bytes / ~12 instructions less per function
Better cache locality without slice header overhead

@d-enk
Copy link
Copy Markdown
Contributor Author

d-enk commented Mar 2, 2026

old.txtnew.txt               │
                                  │    sec/opsec/op     vs baseSumInt8/small/AVX-x16-4              11.74n ±  2%   13.22n ±  4%  +12.60% (p=0.000 n=8)
SumInt8/small/AVX2-x32-4             19.36n ±  6%   17.55n ±  4%   -9.35% (p=0.001 n=8)
SumInt8/medium/AVX-x16-4             13.26n ± 21%   12.47n ±  7%   -5.99% (p=0.015 n=8)
SumInt8/medium/AVX2-x32-4            17.45n ±  5%   28.67n ±  2%  +64.28% (p=0.000 n=8)
SumInt8/large/AVX-x16-4              48.85n ±  2%   45.22n ±  3%   -7.44% (p=0.003 n=8)
SumInt8/large/AVX2-x32-4             33.16n ±  3%   26.35n ±  3%  -20.52% (p=0.000 n=8)
SumInt8/xlarge/AVX-x16-4             305.8n ± 12%   256.2n ± 15%  -16.22% (p=0.010 n=8)
SumInt8/xlarge/AVX2-x32-4            281.8n ± 28%   143.2n ± 11%  -49.21% (p=0.000 n=8)
SumInt16/small/AVX-x8-4              7.346n ± 22%   5.850n ±  3%  -20.37% (p=0.000 n=8)
SumInt16/small/AVX2-x16-4            13.84n ±  6%   14.15n ± 15%        ~ (p=0.505 n=8)
SumInt16/medium/AVX-x8-4             15.35n ±  6%   12.58n ± 12%  -17.99% (p=0.000 n=8)
SumInt16/medium/AVX2-x16-4           12.75n ±  4%   11.15n ±  7%  -12.55% (p=0.006 n=8)
SumInt16/large/AVX-x8-4              89.76n ±  5%   72.66n ±  8%  -19.06% (p=0.000 n=8)
SumInt16/large/AVX2-x16-4            58.96n ± 16%   48.37n ±  7%  -17.97% (p=0.000 n=8)
SumInt16/xlarge/AVX-x8-4             612.5n ±  5%   474.8n ± 14%  -22.47% (p=0.000 n=8)
SumInt16/xlarge/AVX2-x16-4           325.7n ± 24%   250.5n ±  5%  -23.10% (p=0.000 n=8)
SumInt32/small/AVX-x4-4              7.254n ±  2%   6.150n ±  2%  -15.22% (p=0.000 n=8)
SumInt32/small/AVX2-x8-4             6.651n ±  7%   6.084n ±  2%   -8.53% (p=0.010 n=8)
SumInt32/medium/AVX-x4-4             30.08n ±  5%   24.48n ±  3%  -18.59% (p=0.000 n=8)
SumInt32/medium/AVX2-x8-4            14.55n ± 10%   12.15n ±  4%  -16.47% (p=0.000 n=8)
SumInt32/large/AVX-x4-4              171.8n ±  5%   124.6n ±  7%  -27.44% (p=0.000 n=8)
SumInt32/large/AVX2-x8-4             95.56n ±  9%   71.80n ±  5%  -24.87% (p=0.000 n=8)
SumInt32/xlarge/AVX-x4-4            1302.5n ± 11%   952.8n ±  6%  -26.85% (p=0.000 n=8)
SumInt32/xlarge/AVX2-x8-4            631.6n ±  9%   463.2n ±  1%  -26.65% (p=0.000 n=8)
SumInt64/small/AVX-x2-4              9.235n ± 14%   7.116n ±  2%  -22.95% (p=0.000 n=8)
SumInt64/small/AVX2-x4-4             6.291n ±  1%   5.868n ±  2%   -6.73% (p=0.001 n=8)
SumInt64/medium/AVX-x2-4             47.29n ± 13%   30.71n ±  2%  -35.07% (p=0.000 n=8)
SumInt64/medium/AVX2-x4-4            23.45n ± 11%   26.09n ±  5%  +11.21% (p=0.015 n=8)
SumInt64/large/AVX-x2-4              319.3n ±  2%   236.9n ±  3%  -25.79% (p=0.000 n=8)
SumInt64/large/AVX2-x4-4             176.2n ±  7%   127.9n ±  3%  -27.46% (p=0.000 n=8)
SumInt64/xlarge/AVX-x2-4             2.726µ ±  8%   1.981µ ±  6%  -27.33% (p=0.000 n=8)
SumInt64/xlarge/AVX2-x4-4            1.682µ ± 19%   1.082µ ±  2%  -35.68% (p=0.000 n=8)
SumFloat32/small/AVX-x4-4            10.42n ± 27%   10.12n ±  4%        ~ (p=0.645 n=8)
SumFloat32/small/AVX2-x8-4          11.475n ± 13%   9.527n ±  7%  -16.97% (p=0.001 n=8)
SumFloat32/medium/AVX-x4-4           42.24n ± 24%   45.67n ±  2%   +8.12% (p=0.002 n=8)
SumFloat32/medium/AVX2-x8-4          20.09n ±  5%   16.02n ±  1%  -20.28% (p=0.000 n=8)
SumFloat32/large/AVX-x4-4            300.3n ±  6%   296.3n ±  5%        ~ (p=0.938 n=8)
SumFloat32/large/AVX2-x8-4           139.3n ±  2%   127.8n ±  3%   -8.26% (p=0.000 n=8)
SumFloat32/xlarge/AVX-x4-4           2.369µ ±  2%   2.354µ ±  1%        ~ (p=0.721 n=8)
SumFloat32/xlarge/AVX2-x8-4          1.241µ ±  5%   1.154µ ±  6%   -7.01% (p=0.007 n=8)
SumFloat64/small/AVX-x2-4            7.584n ± 27%   9.575n ±  4%  +26.25% (p=0.050 n=8)
SumFloat64/small/AVX2-x4-4           7.571n ±  2%   5.891n ±  2%  -22.18% (p=0.000 n=8)
SumFloat64/medium/AVX-x2-4           78.12n ± 28%   78.62n ±  6%        ~ (p=0.574 n=8)
SumFloat64/medium/AVX2-x4-4          31.53n ±  6%   31.18n ±  4%        ~ (p=1.000 n=8)
SumFloat64/large/AVX-x2-4            646.2n ± 44%   595.2n ±  3%   -7.88% (p=0.003 n=8)
SumFloat64/large/AVX2-x4-4           309.2n ±  7%   267.3n ±  1%  -13.54% (p=0.000 n=8)
SumFloat64/xlarge/AVX-x2-4           6.667µ ± 10%   4.859µ ±  7%  -27.12% (p=0.000 n=8)
SumFloat64/xlarge/AVX2-x4-4          3.209µ ±  9%   2.372µ ±  2%  -26.09% (p=0.000 n=8)
MeanInt32/small/AVX-x4-4            13.050n ± 10%   8.471n ±  1%  -35.09% (p=0.000 n=8)
MeanInt32/small/AVX2-x8-4            14.51n ± 13%   10.42n ±  3%  -28.22% (p=0.000 n=8)
MeanInt32/medium/AVX-x4-4            36.90n ± 19%   20.81n ±  1%  -43.61% (p=0.000 n=8)
MeanInt32/medium/AVX2-x8-4           23.25n ± 11%   15.25n ±  3%  -34.42% (p=0.000 n=8)
MeanInt32/large/AVX-x4-4             178.7n ±  5%   133.1n ±  1%  -25.49% (p=0.000 n=8)
MeanInt32/large/AVX2-x8-4           109.90n ± 28%   79.89n ±  5%  -27.30% (p=0.000 n=8)
MeanInt32/xlarge/AVX-x4-4            1.671µ ± 17%   1.010µ ±  8%  -39.55% (p=0.000 n=8)
MeanInt32/xlarge/AVX2-x8-4           742.0n ± 16%   515.4n ±  4%  -30.54% (p=0.000 n=8)
MeanFloat64/small/AVX-x2-4           14.12n ± 20%   10.26n ± 23%  -27.28% (p=0.001 n=8)
MeanFloat64/small/AVX2-x4-4         11.220n ± 26%   9.695n ±  3%  -13.59% (p=0.000 n=8)
MeanFloat64/medium/AVX-x2-4          85.54n ± 25%   65.62n ± 18%  -23.29% (p=0.038 n=8)
MeanFloat64/medium/AVX2-x4-4         34.88n ±  7%   30.19n ±  7%  -13.45% (p=0.001 n=8)
MeanFloat64/large/AVX-x2-4           665.3n ±  3%   555.7n ±  1%  -16.47% (p=0.000 n=8)
MeanFloat64/large/AVX2-x4-4          327.1n ±  7%   283.4n ±  3%  -13.34% (p=0.000 n=8)
MeanFloat64/xlarge/AVX-x2-4          5.418µ ± 23%   4.731µ ±  3%  -12.68% (p=0.003 n=8)
MeanFloat64/xlarge/AVX2-x4-4         2.870µ ± 17%   2.542µ ±  3%  -11.43% (p=0.001 n=8)
MinInt32/small/AVX-x4-4              5.732n ±  3%   4.532n ±  1%  -20.93% (p=0.000 n=8)
MinInt32/small/AVX2-x8-4             5.675n ±  9%   4.627n ±  1%  -18.48% (p=0.000 n=8)
MinInt32/medium/AVX-x4-4             33.31n ±  2%   27.84n ±  4%  -16.42% (p=0.000 n=8)
MinInt32/medium/AVX2-x8-4            19.57n ± 30%   16.53n ±  3%  -15.56% (p=0.000 n=8)
MinInt32/large/AVX-x4-4              254.6n ± 10%   216.3n ±  3%  -15.01% (p=0.000 n=8)
MinInt32/large/AVX2-x8-4             139.2n ± 10%   110.0n ±  1%  -21.01% (p=0.000 n=8)
MinInt32/xlarge/AVX-x4-4             2.245µ ± 13%   1.648µ ±  4%  -26.61% (p=0.000 n=8)
MinInt32/xlarge/AVX2-x8-4           1217.0n ± 12%   847.6n ±  5%  -30.35% (p=0.000 n=8)
MinFloat64/small/AVX-x2-4            11.92n ± 32%   11.66n ±  2%        ~ (p=0.627 n=8)
MinFloat64/small/AVX2-x4-4           8.325n ± 19%   5.962n ±  2%  -28.39% (p=0.000 n=8)
MinFloat64/medium/AVX-x2-4           150.4n ± 24%   120.2n ±  1%  -20.08% (p=0.015 n=8)
MinFloat64/medium/AVX2-x4-4          47.40n ± 14%   35.02n ±  4%  -26.10% (p=0.000 n=8)
MinFloat64/large/AVX-x2-4            931.6n ± 37%   928.8n ±  2%        ~ (p=1.000 n=8)
MinFloat64/large/AVX2-x4-4           384.0n ± 12%   286.9n ±  9%  -25.28% (p=0.000 n=8)
MinFloat64/xlarge/AVX-x2-4           8.061µ ± 15%   7.908µ ±  5%        ~ (p=0.645 n=8)
MinFloat64/xlarge/AVX2-x4-4          3.059µ ±  8%   2.347µ ±  0%  -23.29% (p=0.000 n=8)
MaxInt32/small/AVX-x4-4              5.939n ±  7%   5.189n ±  4%  -12.64% (p=0.000 n=8)
MaxInt32/small/AVX2-x8-4             6.117n ± 11%   4.543n ±  1%  -25.72% (p=0.000 n=8)
MaxInt32/medium/AVX-x4-4             34.40n ±  7%   28.20n ±  4%  -18.02% (p=0.000 n=8)
MaxInt32/medium/AVX2-x8-4            20.04n ±  9%   16.86n ±  4%  -15.87% (p=0.000 n=8)
MaxInt32/large/AVX-x4-4              256.7n ± 17%   208.7n ±  1%  -18.70% (p=0.000 n=8)
MaxInt32/large/AVX2-x8-4             154.8n ± 37%   111.1n ±  3%  -28.21% (p=0.000 n=8)
MaxInt32/xlarge/AVX-x4-4             2.323µ ± 18%   1.769µ ±  7%  -23.85% (p=0.000 n=8)
MaxInt32/xlarge/AVX2-x8-4            969.5n ±  7%   811.1n ±  4%  -16.33% (p=0.000 n=8)
MaxFloat64/small/AVX-x2-4            10.46n ± 33%   12.77n ±  3%  +22.07% (p=0.000 n=8)
MaxFloat64/small/AVX2-x4-4           12.27n ±  3%   12.16n ±  2%        ~ (p=0.978 n=8)
MaxFloat64/medium/AVX-x2-4           130.6n ± 18%   120.8n ±  5%        ~ (p=0.083 n=8)
MaxFloat64/medium/AVX2-x4-4          45.92n ±  7%   38.05n ±  4%  -17.16% (p=0.000 n=8)
MaxFloat64/large/AVX-x2-4           1003.1n ± 21%   926.0n ±  1%        ~ (p=0.130 n=8)
MaxFloat64/large/AVX2-x4-4           325.9n ±  7%   299.4n ±  2%   -8.15% (p=0.000 n=8)
MaxFloat64/xlarge/AVX-x2-4           8.442µ ± 15%   7.522µ ±  1%        ~ (p=0.105 n=8)
MaxFloat64/xlarge/AVX2-x4-4          2.527µ ±  3%   2.490µ ±  5%        ~ (p=0.664 n=8)
SumInt8ByWidth/AVX-x16-4             150.8n ± 10%   126.7n ±  1%  -16.01% (p=0.000 n=8)
SumInt8ByWidth/AVX2-x32-4            97.53n ±  3%   78.44n ±  4%  -19.58% (p=0.000 n=8)
SumInt64SteadyState/AVX-x2-4         3.049µ ±  9%   1.901µ ±  2%  -37.67% (p=0.000 n=8)
SumInt64SteadyState/AVX2-x4-4        1.441µ ±  9%   1.062µ ±  6%  -26.24% (p=0.000 n=8)
geomean                              120.3n         102.7n        -14.60%

@d-enk d-enk force-pushed the perf-simd-no-bounds-check branch 2 times, most recently from d0b5c52 to c987df8 Compare March 2, 2026 15:04
@d-enk
Copy link
Copy Markdown
Contributor Author

d-enk commented Mar 2, 2026

For some reason, there's degradation for these.

SumInt8/medium/AVX2-x32
SumInt64/medium/AVX2-x4
SumFloat32/medium/AVX-x4

I couldn't figure out why. Some kind of tricky caching...

Replace slice-based access (base[i:i+lanes]) with unsafeIndexVec to eliminate
runtime.panicBounds checks in critical SIMD loops.

Changes:
- Replace unsafeSliceInt8/16/32/64 and unsafeSliceUint* with unsafeIndexVec
- Add unsafeIndexBase and unsafeIndexOffset for optimal performance
- Update Sum*, Mean*, Min*, Max*, Contains*, Clamp* functions
- Remove unused unsafeSlice* helper functions

Generated code size: ~96 bytes / ~12 instructions less per function
Better cache locality without slice header overhead
@d-enk d-enk force-pushed the perf-simd-no-bounds-check branch from c987df8 to babf7ad Compare March 2, 2026 15:12
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.48%. Comparing base (14fff24) to head (babf7ad).
⚠️ Report is 44 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #825      +/-   ##
==========================================
+ Coverage   92.11%   92.48%   +0.37%     
==========================================
  Files          32       32              
  Lines        4259     5019     +760     
==========================================
+ Hits         3923     4642     +719     
- Misses        252      274      +22     
- Partials       84      103      +19     
Flag Coverage Δ
unittests 92.48% <ø> (+0.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-enk d-enk marked this pull request as draft March 13, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant