Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowVars(..., center=0): Should I stay or should I go? (most likely go) #193

Closed
HenrikBengtsson opened this issue Dec 4, 2020 · 12 comments
Closed
Milestone

Comments

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Dec 4, 2020

In const-ae/sparseMatrixStats#13 (comment) @hpages argues for:

rowVars(X, center=0, na.rm=TRUE)

to stay with the rationale that it provides an efficent way to calculate:

rowSums(X^2 / (rowSums(!is.na(X))-1L), na.rm=TRUE)

BTW, the latter can be improved further; see below for details.

Issue

In the next release, I've decided to tighten up what center can take, specifically, it must be of the same length as the number of rows in X, cf. #187 (comment). Using a scalar, as here, will result in a deprecation warning, e.g.

Warning message:
Argument 'center' should be of the same length as number of rows of 'x'.
Use of a scalar value is deprecated: 1 != 3 

Now, although unclear, the purpose of the argument center was always meant to provide a way to avoid having to re-estimate center if already estimated, e.g.

mean_and_var <- function(X) {
  mean <- rowMeans(X)
  var <- rowVars(X, center = mean)
  list(mean = mu, var = var)
}

Using rowVars(X, center = 0) as an efficient version of

rowSums(X^2 / (rowSums(!is.na(X))-1L), na.rm=TRUE)

is more of a side-effect rather than being the intended usage. Point is, this use of center is different from the original use of center and there's always a risk of mistakes, misunderstandings, and bugs when something has more than one purpose.

For the above reasons, unless someone can convince me otherwise, my plan is to go ahead a deprecate scalar values on center as is center = 0.

BTW, as it stands now, with the next release, using:

rowVars(X, center = 0)

will still work but to get rid of the above warning one needs to do:

rowVars(X, center = rep(0, times = nrow(X)))

to achieve the same thing. From the below benchmarking, this approach is ~25% faster compared to the version that does not use a center argument.

Benchmarking

We have:

f1 <- function(X) {
  rowSums(X^2 / (rowSums(!is.na(X))-1L), na.rm=TRUE)
}

but a more efficent implementation of this is:

f2 <- function(X) {
  rowSums(X^2, na.rm = TRUE) / (rowSums(!is.na(X)) - 1L)
}

Now, rowSums(!is.na(X)) can be calculated as:

ncol(X) - rowSums(is.na(X))

This avoids the negation !is.na(X) of a full matrix;

f3 <- function(X) {
  rowSums(X^2, na.rm = TRUE) / (ncol(X) - rowSums(is.na(X)) - 1L)
}

We can also avoid the is.na(X) "coercion" by using rowCounts() as in:

ncol(X) - rowCounts(X, value = NA_real_)

Thus, a more efficent version of the above is:

f4 <- function(X) {
  rowSums(X^2, na.rm = TRUE) / (ncol(X) - 1L - rowCounts(X, value = NA_real_))
}
> f0 <- function(X) rowVars(X, center = rep(0, times = nrow(X)))
> bench::mark(f0(X), f1(X), f2(X), f3(X), f4(X))[,1:9]
# A tibble: 5 x 9
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f0(X)        24.5µs   28.5µs    31883.    52.7KB     31.9  9990    10      313ms
2 f1(X)        69.7µs   75.9µs    12894.   101.5KB     23.5  6046    11      469ms
3 f2(X)        63.1µs   70.4µs    13995.   101.5KB     28.1  6477    13      463ms
4 f3(X)        57.9µs   61.8µs    15637.    76.4KB     23.6  7304    11      467ms
5 f4(X)        34.5µs   37.9µs    25106.    51.1KB     22.6  9991     9      398ms

Comment: When specifying center, rowVars() is not fully using native code, i.e. f0() would be faster if/when we re-implement center in native code.

@hpages
Copy link

hpages commented Dec 4, 2020

When specifying center, rowVars() is not fully using native code.

I'll just reiterate that this is really at the center of this entire discussion. Again, all the problems, misunderstandings, frustration, and sadness around the current handling of center are only side effects of the unfortunate situation that rowVars(..., center=my_center) is not using the same path as when center is not specified.

@HenrikBengtsson
Copy link
Owner Author

Hmm... that's just a difference in performance - I'm after the behavior here.

@hpages
Copy link

hpages commented Dec 4, 2020

Don't know how it is now but last time I checked (a couple of weeks ago) it was both performance and behavior. The 2 paths were not using the same formula. My point is that if rowVars() had used a single path and single formula, whether center is specified or not, a lot of the discussions about what to keep or drop, benchmarks, surprising behaviors, etc... would become moot.

@HenrikBengtsson
Copy link
Owner Author

But only if the native code had implemented the primary formula and not the alternative one. My point here is, let's not look what's going on inside but what the API should provide and what role center should have.

@hpages
Copy link

hpages commented Dec 4, 2020

But once the inside is fixed, the behavior becomes healthy and there is no need a priori to try to artificially block the use of arbitrary centers. It's counter-productive.

@HenrikBengtsson
Copy link
Owner Author

Ok, let's simplify the discussion so we can get to the same page. You're the first one I've heard arguing for using center = 0. Can we both agree that that is a very special case that only applies to (L2) sample variance estimation? More specifically, can we agree that using:

  • length(center) == 1 && center != 0

is a mistake and we should produce an error when that is used?

@hpages
Copy link

hpages commented Dec 5, 2020

I don't see this as a mistake. If that's what the user wants to do then so be it.

The only thing that matters is that the user knows what the function is doing with center. Right now they don't, which again, is at the root of the problem.

So to me this is a very simple situation where you just need to say that rowVars(x, center) does:

sum((x[i, ] - center[i])^2) / (length(x[i, ]) - 1)

for each row, and that center will be set to center <- rowMeans(x) by default. That's really all there is to it. Yes, it's all about clarifying the semantic. Then the user can decide what they do with center. I don't think that trying to detect the valid or invalid uses of center is that useful or productive.

@HenrikBengtsson
Copy link
Owner Author

Mkay... Let's move on. For me to understand your center = 0 case - is:

sum((x - center)^2) / (length(x) - 1)

used to estimate a variance? If not, in what use case does it come up for you?

@hpages
Copy link

hpages commented Dec 5, 2020

My use case is not important. If I can't use rowVars(x, center=0) for it, I'll use something else. No worries.

Thanks

@HenrikBengtsson
Copy link
Owner Author

But, it's useful for me to decide what to do here. If rowVars(x, center = 0) is not used to estimate the variance, then it's a clear case for me - whatever the calculation is done should not be done by rowVars() because that exists in order to estimate the variance. Any other use is basically just surfing on side effects from how it's calculated (which is something that was never clearly defined, hence Issue #187).

@hpages
Copy link

hpages commented Dec 5, 2020

My use case is to support scale() on DelayedMatrix objects: https://github.com/Bioconductor/DelayedArray/blob/aa9a74556f5c36307ceb638eb31199477139fea8/R/DelayedArray-utils.R#L857

Yes, it's fully understood that when center is not set to rowMeans() then what rowVars() returns should not longer be considered the row variances, in the usual sense. More something like "centered row variances". It doesn't really need a name, as long as it's well defined and coincides with the row variances by default.

@HenrikBengtsson HenrikBengtsson changed the title rowVars(..., center=0): Should I stay or should I go? (mostly like go) rowVars(..., center=0): Should I stay or should I go? (most likely go) Dec 5, 2020
@HenrikBengtsson
Copy link
Owner Author

Sorry for the delay here. I didn't consider scale() and that it supports pre-estimated center values, which by the way opens up to the same issues we've had when an incorrect, or a biases center is given in rowVars() and friends.

Looking at scale(..., center), we find that it also does not support scalar center values and it requires that it's of the correct length, e.g.

> X <- matrix(1:6, nrow = 2)
> scale(X, center = 0)
Error in scale.default(X, center = 0) : 
  length of 'center' must equal the number of columns of 'x'

In order to move forward now and later on (e.g. implement support for center in native code), and to minimize the risk for user mistakes, I've decided to move forward and deprecate support for scalar center. Hoping to submit the release asap (= after running 300+ revdep checks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants