Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: unique* functions in lazy libraries: add size=None, fill_value=None #883

Open
crusaderky opened this issue Jan 13, 2025 · 4 comments
Open
Labels
Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.

Comments

@crusaderky
Copy link

crusaderky commented Jan 13, 2025

The functions

currently come with the disclaimer

The shapes of two of the output arrays for this function depend on the data values in the input array; hence, array libraries which build computation graphs (e.g., JAX, Dask, etc.) may find this function difficult to implement without knowing array values. Accordingly, such libraries may choose to omit this function. See Data-dependent output shapes section for more details.

I propose to

  • Add two optional parameters size=None, fill_value=None, lifted from JAX
  • Change the disclaimer:
- Accordingly, such libraries may choose to omit this function.
+ Accordingly, such libraries may choose to require the size= parameter.

This not only helps with JAX and ndonnx interoperability, but also with Dask as it would allow to avoid NaN-sized arrays.

@cbourjau
Copy link
Contributor

Please note that while ONNX models may profit from statically known dimension lengths, it is not a necessity. The only thing that is required from ndonnx's perspective is that operations produce arrays with statically known rank. That said, we could support the proposed API in ndonnx, too.

@kgryte kgryte added the RFC Request for comments. Feature requests and proposed changes. label Jan 15, 2025
@kgryte
Copy link
Contributor

kgryte commented Jan 20, 2025

@crusaderky Thank you for opening this RFC.

Question: while size and fill_value are useful for JIT'd libraries, such as JAX, what would be the benefit/application for a library such as NumPy implementing these kwargs? It seems like these wouldn't be useful or applicable when using, e.g., NumPy.

If we were to standardize, we'd want to ensure broad applicability across array libraries, JIT'd or not.

@kgryte kgryte added the Needs Discussion Needs further discussion. label Jan 20, 2025
@crusaderky
Copy link
Author

crusaderky commented Jan 20, 2025

Question: while size and fill_value are useful for JIT'd libraries, such as JAX, what would be the benefit/application for a library such as NumPy implementing these kwargs? It seems like these wouldn't be useful or applicable when using, e.g., NumPy.

It would be useful to speed up "fail early" and fast-path tests, or in general whenever you don't care beyond a certain unique count. e.g. you could change

if xp.unique_values(arr).size < 10:
   fast_path(arr)
else:
   slow_path(arr)

to

_, counts = xp.unique_counts(arr, size=10)
if counts[-1] != 0:  # <10 unique elements
   fast_path(arr)
else:
   slow_path(arr)    

The runtime has changed from O(n), where n=arr.size, to O(1~n) depending on the entropy level of arr; O(1) when contents order is uniformly randomized.
If size <<arr.size this can be a OOM speedup.

@crusaderky
Copy link
Author

crusaderky commented Jan 20, 2025

Another important use case: on Dask, this is a reduction equivalent to topk, where a naive implementation necessarily needs to fit the intermediate and final output on a single chunk, and a non-naive one without this restriction is much more complicated and slower (it needs to be built on top of Dask's shuffle infrastructure).

So having a size= parameter on dask could make the difference between an algorithm that can ingest any data in a predictable amount of time and one that can cause KilledWorker worker deaths due to OOM depending on what data it ingests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

No branches or pull requests

3 participants