Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of dnp.nan_to_num #2228

Merged

Conversation

ndgrigorian
Copy link
Collaborator

@ndgrigorian ndgrigorian commented Dec 10, 2024

This PR adds a dedicated kernel for dnp.nan_to_num to improve its performance. This reduces the number of kernel calls to at most one in all cases.

A kernel for both strided and contiguous inputs have been added, to avoid additional allocation of device memory for trivial strides when input is fully C- or F-contiguous.

For example of performance gains, using Max GPU

master:

In [1]: import dpnp as dnp

In [2]: import numpy as np

In [3]: x_np = np.random.randn(10**9)

In [4]: x_np[np.random.choice(x_np.size, 200, replace=False)] = np.nan

In [5]: x = dnp.asarray(x_np)

In [6]: q = x.sycl_queue

In [7]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 394 ms, sys: 43.8 ms, total: 438 ms
Wall time: 304 ms

In [8]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 333 ms, sys: 31.8 ms, total: 364 ms
Wall time: 134 ms

on branch:

In [8]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 49.6 ms, sys: 8.1 ms, total: 57.7 ms
Wall time: 60.9 ms

In [9]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 22.9 ms, sys: 16 ms, total: 38.9 ms
Wall time: 19.7 ms
  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you filing the PR as a draft?

@ndgrigorian ndgrigorian force-pushed the improve-nan-to-num-performance branch from 48f623b to 0693c0b Compare December 26, 2024 18:39
@ndgrigorian ndgrigorian force-pushed the improve-nan-to-num-performance branch from 0693c0b to 50c28f7 Compare January 12, 2025 02:13
@coveralls
Copy link
Collaborator

coveralls commented Jan 12, 2025

Coverage Status

coverage: 71.572% (-0.3%) from 71.848%
when pulling 3acc9c4 on ndgrigorian:improve-nan-to-num-performance
into 5b140db on IntelPython:master.

@ndgrigorian ndgrigorian force-pushed the improve-nan-to-num-performance branch 2 times, most recently from 0c6d3f8 to 54dfaf5 Compare January 28, 2025 19:16
@ndgrigorian ndgrigorian force-pushed the improve-nan-to-num-performance branch from 5d19afb to d1fb595 Compare February 2, 2025 22:36
@ndgrigorian ndgrigorian force-pushed the improve-nan-to-num-performance branch 3 times, most recently from 1ac1288 to 3c66551 Compare February 4, 2025 21:36
@ndgrigorian ndgrigorian force-pushed the improve-nan-to-num-performance branch from 3c66551 to 99fd28b Compare February 5, 2025 18:35
Copy link
Contributor

@antonwolfy antonwolfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ndgrigorian, no more comments from me

@ndgrigorian
Copy link
Collaborator Author

Thank you @ndgrigorian, no more comments from me

Great, feel free to merge at your convenience

@ndgrigorian
Copy link
Collaborator Author

Also, for posterity, the tests all pass on CUDA

@antonwolfy antonwolfy merged commit 77702b3 into IntelPython:master Feb 5, 2025
63 of 70 checks passed
github-actions bot added a commit that referenced this pull request Feb 5, 2025
This PR adds a dedicated kernel for `dnp.nan_to_num` to improve its
performance. This reduces the number of kernel calls to at most one in
all cases.

A kernel for both strided and contiguous inputs have been added, to
avoid additional allocation of device memory for trivial strides when
input is fully C- or F-contiguous.

For example of performance gains, using Max GPU

master:
```python
In [1]: import dpnp as dnp

In [2]: import numpy as np

In [3]: x_np = np.random.randn(10**9)

In [4]: x_np[np.random.choice(x_np.size, 200, replace=False)] = np.nan

In [5]: x = dnp.asarray(x_np)

In [6]: q = x.sycl_queue

In [7]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 394 ms, sys: 43.8 ms, total: 438 ms
Wall time: 304 ms

In [8]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 333 ms, sys: 31.8 ms, total: 364 ms
Wall time: 134 ms
```

on branch:
```python
In [8]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 49.6 ms, sys: 8.1 ms, total: 57.7 ms
Wall time: 60.9 ms

In [9]: %time r = dnp.nan_to_num(x); q.wait()
CPU times: user 22.9 ms, sys: 16 ms, total: 38.9 ms
Wall time: 19.7 ms
``` 77702b3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants