Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
环境
改进前
改进后
加速比
优化方法
改成 YX 序遍历;改用 _mm256_stream_ps 直写内存。
用 loop tiling + morton ordering 的方式优化,分块时保证
BLOCKSIZE ^ 2
小于L1缓存大小,利用 TBB 的tbb::simple_partitioner
进行 morton ordering 遍历。同
matrix_transpose
的优化,但用一个临时变量额外存累加和,最后向out(x, y)
写结果。将两个临时变量局部静态化,即手动池化,防止重复分配内存。p.s. 虽然没有多线程,但还是用课上的
thread_local
修饰,保证线程安全。