Improve performance of `dnp.nan_to_num` #2228

ndgrigorian · 2024-12-10T05:38:08Z

This PR adds a dedicated kernel for dnp.nan_to_num to improve its performance. This reduces the number of kernel calls to at most one in all cases.

A kernel for both strided and contiguous inputs have been added, to avoid additional allocation of device memory for trivial strides when input is fully C- or F-contiguous.

For example of performance gains, using Max GPU

master:

In [1]: import dpnp as dnp

In [2]: import numpy as np

In [3]: x_np = np.random.randn(10**9)

In [4]: x_np[np.random.choice(x_np.size, 200, replace=False)] = np.nan

In [5]: x = dnp.asarray(x_np)

In [6]: q = x.sycl_queue

In [7]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 394 ms, sys: 43.8 ms, total: 438 ms
Wall time: 304 ms

In [8]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 333 ms, sys: 31.8 ms, total: 364 ms
Wall time: 134 ms

on branch:

In [8]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 49.6 ms, sys: 8.1 ms, total: 57.7 ms
Wall time: 60.9 ms

In [9]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 22.9 ms, sys: 16 ms, total: 38.9 ms
Wall time: 19.7 ms

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
If this PR is a work in progress, are you filing the PR as a draft?

dpnp/dpnp_iface_mathematical.py

dpnp/backend/kernels/elementwise_functions/nan_to_num.hpp

dpnp/backend/extensions/ufunc/elementwise_functions/nan_to_num.cpp

dpnp/backend/kernels/elementwise_functions/nan_to_num.hpp

coveralls · 2025-01-12T05:56:43Z

coverage: 71.572% (-0.3%) from 71.848%
when pulling 3acc9c4 on ndgrigorian:improve-nan-to-num-performance
into 5b140db on IntelPython:master.

dpnp/dpnp_iface_mathematical.py

dpnp/backend/extensions/ufunc/elementwise_functions/nan_to_num.cpp

Use std::conditional and value_type_of_t struct to avoid constexpr branches with redundant code

* nan_to_num_call -> nan_to_num_strided_call * add missing const markers on converted Python scalar objects

dpnp/backend/extensions/ufunc/elementwise_functions/nan_to_num.cpp

antonwolfy

Thank you @ndgrigorian, no more comments from me

ndgrigorian · 2025-02-05T20:28:12Z

Thank you @ndgrigorian, no more comments from me

Great, feel free to merge at your convenience

ndgrigorian · 2025-02-05T22:21:37Z

Also, for posterity, the tests all pass on CUDA

This PR adds a dedicated kernel for `dnp.nan_to_num` to improve its performance. This reduces the number of kernel calls to at most one in all cases. A kernel for both strided and contiguous inputs have been added, to avoid additional allocation of device memory for trivial strides when input is fully C- or F-contiguous. For example of performance gains, using Max GPU master: ```python In [1]: import dpnp as dnp In [2]: import numpy as np In [3]: x_np = np.random.randn(10**9) In [4]: x_np[np.random.choice(x_np.size, 200, replace=False)] = np.nan In [5]: x = dnp.asarray(x_np) In [6]: q = x.sycl_queue In [7]: %time r = dnp.nan_to_num(x); q.wait() CPU times: user 394 ms, sys: 43.8 ms, total: 438 ms Wall time: 304 ms In [8]: %time r = dnp.nan_to_num(x); q.wait() CPU times: user 333 ms, sys: 31.8 ms, total: 364 ms Wall time: 134 ms ``` on branch: ```python In [8]: %time r = dnp.nan_to_num(x); q.wait() CPU times: user 49.6 ms, sys: 8.1 ms, total: 57.7 ms Wall time: 60.9 ms In [9]: %time r = dnp.nan_to_num(x); q.wait() CPU times: user 22.9 ms, sys: 16 ms, total: 38.9 ms Wall time: 19.7 ms ``` 77702b3

ndgrigorian requested review from antonwolfy, AlexanderKalistratov, vlad-perevezentsev and vtavana as code owners December 10, 2024 05:38

antonwolfy reviewed Dec 10, 2024

View reviewed changes

dpnp/backend/kernels/elementwise_functions/nan_to_num.hpp Show resolved Hide resolved

ndgrigorian mentioned this pull request Dec 10, 2024

Performance issue with NaN functions #2086

Open

ndgrigorian force-pushed the improve-nan-to-num-performance branch from 48f623b to 0693c0b Compare December 26, 2024 18:39

ndgrigorian force-pushed the improve-nan-to-num-performance branch from 0693c0b to 50c28f7 Compare January 12, 2025 02:13

ndgrigorian force-pushed the improve-nan-to-num-performance branch 2 times, most recently from 0c6d3f8 to 54dfaf5 Compare January 28, 2025 19:16

ndgrigorian force-pushed the improve-nan-to-num-performance branch from 5d19afb to d1fb595 Compare February 2, 2025 22:36

ndgrigorian requested a review from antonwolfy February 3, 2025 07:31

ndgrigorian force-pushed the improve-nan-to-num-performance branch 3 times, most recently from 1ac1288 to 3c66551 Compare February 4, 2025 21:36

antonwolfy reviewed Feb 5, 2025

View reviewed changes

dpnp/dpnp_iface_mathematical.py Outdated Show resolved Hide resolved

antonwolfy reviewed Feb 5, 2025

View reviewed changes

dpnp/backend/extensions/ufunc/elementwise_functions/nan_to_num.cpp Outdated Show resolved Hide resolved

dpnp/backend/extensions/ufunc/elementwise_functions/nan_to_num.cpp Outdated Show resolved Hide resolved

ndgrigorian added 11 commits February 5, 2025 09:19

Add kernel for nan_to_num to ufunc extension

31d9c82

Add missing headers in nan_to_num.cpp

5ad83b0

Add contiguous kernel for nan_to_num

d99c0d5

Fix missed ssize_t to dpctl::tensor::ssize_t

13cd3b5

Clean-up nan_to_num.cpp dead code

f4c6782

Use dpnp.copy instead of copy method in nan_to_num

3772815

Fix typo in nan_to_num

ffb1379

inline to_num in nan_to_num kernel

16d4ea2

Add additional const qualifiers in nan_to_num impl functions

cd13f40

Use is_complex_v in nan_to_num kernel

533963c

Simplify nan_to_num call logic

99cc211

Use std::conditional and value_type_of_t struct to avoid constexpr branches with redundant code

ndgrigorian added 6 commits February 5, 2025 09:19

Improve test coverage for nan_to_num

dc6e7f9

Align with changes to device_allocate_and_pack in dpctl

cc45176

Add subgroup load and store based implementation for nan_to_num kernel

00e51ad

size_t -> std::size_t in nan_to_num Python binding

9dd3e60

nan_to_num always returns dpnp_array

e9953f5

Apply review comments

99fd28b

* nan_to_num_call -> nan_to_num_strided_call * add missing const markers on converted Python scalar objects

ndgrigorian force-pushed the improve-nan-to-num-performance branch from 3c66551 to 99fd28b Compare February 5, 2025 18:35

ndgrigorian requested a review from antonwolfy February 5, 2025 18:36

antonwolfy reviewed Feb 5, 2025

View reviewed changes

dpnp/backend/extensions/ufunc/elementwise_functions/nan_to_num.cpp Outdated Show resolved Hide resolved

nan_to_num_impl -> nan_to_num_strided_impl

3acc9c4

antonwolfy approved these changes Feb 5, 2025

View reviewed changes

antonwolfy merged commit 77702b3 into IntelPython:master Feb 5, 2025
63 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `dnp.nan_to_num` #2228

Improve performance of `dnp.nan_to_num` #2228

ndgrigorian commented Dec 10, 2024 •

edited

Loading

coveralls commented Jan 12, 2025 •

edited

Loading

antonwolfy left a comment

ndgrigorian commented Feb 5, 2025

ndgrigorian commented Feb 5, 2025

Improve performance of dnp.nan_to_num #2228

Improve performance of dnp.nan_to_num #2228

Conversation

ndgrigorian commented Dec 10, 2024 • edited Loading

coveralls commented Jan 12, 2025 • edited Loading

antonwolfy left a comment

Choose a reason for hiding this comment

ndgrigorian commented Feb 5, 2025

ndgrigorian commented Feb 5, 2025

Improve performance of `dnp.nan_to_num` #2228

Improve performance of `dnp.nan_to_num` #2228

ndgrigorian commented Dec 10, 2024 •

edited

Loading

coveralls commented Jan 12, 2025 •

edited

Loading