Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base R function for length(unique(x)) #149

Open
hughjonesd opened this issue Jul 22, 2023 · 4 comments
Open

Base R function for length(unique(x)) #149

hughjonesd opened this issue Jul 22, 2023 · 4 comments

Comments

@hughjonesd
Copy link

It's common to need to know how many unique elements there are in a vector. This often happens on the REPL during exploration of data. A single function like

nunique <- function (x) length(unique(x))

could save time and help readability.

@HenrikBengtsson
Copy link
Owner

could save time and help readability.

I can see how it can help readability, but what's your thoughts how to increase the performance? My understanding is that one still has to allocate memory for and keep track of all unique values, which is what unique() does. I don't think that step can be avoided, but I might be wrong.

@karoliskoncevicius
Copy link

karoliskoncevicius commented Feb 14, 2025

Faster performance should be for factors, which keep track of possible levels. One solution might be to make nlevels() generic and work on character vectors too. That would avoid introducing new functions. Thou the name is not so intuitive.

But in general yes, many times I also wished to have a function that does length(unique()), specially within constructs like Map() where instead of writing Map(nunique, list) I need to write Map(function(x) length(unique(x)), list)

@TimTaylor
Copy link

I can see how it can help readability, but what's your thoughts how to increase the performance? My understanding is that one still has to allocate memory for and keep track of all unique values, which is what unique() does. I don't think that step can be avoided, but I might be wrong.

I think your only option would be to go to the c code. I think it would mean adding a fourth op at:
https://github.com/wch/r-source/blob/4f07a781cb5f10cb297e991fa1139431f2cd053f/src/main/unique.c#L1097

and then returning early at

https://github.com/wch/r-source/blob/4f07a781cb5f10cb297e991fa1139431f2cd053f/src/main/unique.c#L1162

This would avoid the additional allocation for unique and doesn't seem too invasive (that said, I've no idea how involved the first (op) part is/would be ...)

@HenrikBengtsson
Copy link
Owner

Thanks both. Awesome. Sounds like there's indeed also room for performance improvements, by avoiding that memory allocation and copying at the end.

Faster performance should be for factors, which keep track of possible levels.

It sounds like unique() could also benefit from this. That one, is definitely a great candidate for R Dev Days, because it's not changing any API.

Introducing nunique() should probably be brought up on R-devel first, using the arguments your provided here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants