cpu: pooling: fix crashes of large tensor processing #2875

asimonov1 · 2025-03-13T12:21:40Z

This PR fixes crashes of simple_nchw and simple_nhwc in some cases that were found in MFDNN-13286.
It also reduces memory allocation in simple_nchw in case of back propagation for non-f32 data types: simple_nchw requested scratchpad memory for every available thread even if the number of work items was smaller (e.g. it requested more than 4TB for mb1ic1iw4294967311ow858993461kw7sw5pw0 according to logs in MFDNN-13286; the size depends on the number of available threads). This helps in some cases (particularly, in the cases from MFDNN-13286).

asimonov1 · 2025-03-13T17:27:01Z

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_pool

dzarukin · 2025-03-13T19:19:03Z

src/cpu/nchw_pooling.hpp

-                    nstl::min(C_per_thr, max_block_size / data_size_per_ch),
-                    (dim_t)1);
+        void calculate_nthr_and_channel_block_size() {
+            nthr_ = dnnl_get_max_threads();


nthr_ are supposed to be always equal to the upper level the user sets. The reason for that is the following:

Parallel function creates a pool of specified number of threads, and when the number of threads changes (let's consider convolution before pooling which used all threads), the underlying OMP/TBB implementation will re-create the pool object, which is global, at runtime, and this re-creation is expensive, and then the next op might use again the full number of threads and trigger the underlying pool object re-creation again, so paying double the price.

Allowed scenarios are limited to either passing nthr=1 instead of nthr_ from pd to the parallel call, which causes a sequential run and avoid underlying parallel section creation all together, or inside the parallel call make tail threads/workers drop the work with a condition.

This way it has way less overhead on spawning extra threads than on the singleton re-creation. This will require careful alignment between parallelizing logic with blocking/balancing logic.

cpu: pooling: fix crashes of large tensor processing

90de973

asimonov1 requested a review from a team as a code owner March 13, 2025 12:21

asimonov1 requested a review from a team March 13, 2025 12:22

dzarukin reviewed Mar 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: pooling: fix crashes of large tensor processing #2875

cpu: pooling: fix crashes of large tensor processing #2875

asimonov1 commented Mar 13, 2025

asimonov1 commented Mar 13, 2025

dzarukin Mar 13, 2025

cpu: pooling: fix crashes of large tensor processing #2875

Are you sure you want to change the base?

cpu: pooling: fix crashes of large tensor processing #2875

Conversation

asimonov1 commented Mar 13, 2025

asimonov1 commented Mar 13, 2025

dzarukin Mar 13, 2025

Choose a reason for hiding this comment