Base R function for length(unique(x)) #149

hughjonesd · 2023-07-22T10:29:01Z

It's common to need to know how many unique elements there are in a vector. This often happens on the REPL during exploration of data. A single function like

nunique <- function (x) length(unique(x))

could save time and help readability.

The text was updated successfully, but these errors were encountered:

HenrikBengtsson · 2025-02-14T18:51:07Z

could save time and help readability.

I can see how it can help readability, but what's your thoughts how to increase the performance? My understanding is that one still has to allocate memory for and keep track of all unique values, which is what unique() does. I don't think that step can be avoided, but I might be wrong.

karoliskoncevicius · 2025-02-14T19:42:15Z

Faster performance should be for factors, which keep track of possible levels. One solution might be to make nlevels() generic and work on character vectors too. That would avoid introducing new functions. Thou the name is not so intuitive.

But in general yes, many times I also wished to have a function that does length(unique()), specially within constructs like Map() where instead of writing Map(nunique, list) I need to write Map(function(x) length(unique(x)), list)

TimTaylor · 2025-02-14T20:18:41Z

I can see how it can help readability, but what's your thoughts how to increase the performance? My understanding is that one still has to allocate memory for and keep track of all unique values, which is what unique() does. I don't think that step can be avoided, but I might be wrong.

I think your only option would be to go to the c code. I think it would mean adding a fourth op at:
https://github.com/wch/r-source/blob/4f07a781cb5f10cb297e991fa1139431f2cd053f/src/main/unique.c#L1097

and then returning early at

https://github.com/wch/r-source/blob/4f07a781cb5f10cb297e991fa1139431f2cd053f/src/main/unique.c#L1162

This would avoid the additional allocation for unique and doesn't seem too invasive (that said, I've no idea how involved the first (op) part is/would be ...)

HenrikBengtsson · 2025-02-14T20:35:07Z

Thanks both. Awesome. Sounds like there's indeed also room for performance improvements, by avoiding that memory allocation and copying at the end.

Faster performance should be for factors, which keep track of possible levels.

It sounds like unique() could also benefit from this. That one, is definitely a great candidate for R Dev Days, because it's not changing any API.

Introducing nunique() should probably be brought up on R-devel first, using the arguments your provided here.

HenrikBengtsson added the r-dev-day-candidate label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base R function for length(unique(x)) #149

Base R function for length(unique(x)) #149

hughjonesd commented Jul 22, 2023

HenrikBengtsson commented Feb 14, 2025

karoliskoncevicius commented Feb 14, 2025 •

edited

Loading

TimTaylor commented Feb 14, 2025

HenrikBengtsson commented Feb 14, 2025

Base R function for length(unique(x)) #149

Base R function for length(unique(x)) #149

Comments

hughjonesd commented Jul 22, 2023

HenrikBengtsson commented Feb 14, 2025

karoliskoncevicius commented Feb 14, 2025 • edited Loading

TimTaylor commented Feb 14, 2025

HenrikBengtsson commented Feb 14, 2025

karoliskoncevicius commented Feb 14, 2025 •

edited

Loading