Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu: pooling: fix crashes of large tensor processing #2875

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

asimonov1
Copy link
Contributor

This PR fixes crashes of simple_nchw and simple_nhwc in some cases that were found in MFDNN-13286.
It also reduces memory allocation in simple_nchw in case of back propagation for non-f32 data types: simple_nchw requested scratchpad memory for every available thread even if the number of work items was smaller (e.g. it requested more than 4TB for mb1ic1iw4294967311ow858993461kw7sw5pw0 according to logs in MFDNN-13286; the size depends on the number of available threads). This helps in some cases (particularly, in the cases from MFDNN-13286).

@asimonov1 asimonov1 requested a review from a team as a code owner March 13, 2025 12:21
@asimonov1 asimonov1 requested a review from a team March 13, 2025 12:22
@asimonov1
Copy link
Contributor Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_pool

nstl::min(C_per_thr, max_block_size / data_size_per_ch),
(dim_t)1);
void calculate_nthr_and_channel_block_size() {
nthr_ = dnnl_get_max_threads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nthr_ are supposed to be always equal to the upper level the user sets. The reason for that is the following:

Parallel function creates a pool of specified number of threads, and when the number of threads changes (let's consider convolution before pooling which used all threads), the underlying OMP/TBB implementation will re-create the pool object, which is global, at runtime, and this re-creation is expensive, and then the next op might use again the full number of threads and trigger the underlying pool object re-creation again, so paying double the price.

Allowed scenarios are limited to either passing nthr=1 instead of nthr_ from pd to the parallel call, which causes a sequential run and avoid underlying parallel section creation all together, or inside the parallel call make tail threads/workers drop the work with a condition.

This way it has way less overhead on spawning extra threads than on the singleton re-creation. This will require careful alignment between parallelizing logic with blocking/balancing logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants