rowVars(..., center=0): Should I stay or should I go? (most likely go) #193

HenrikBengtsson · 2020-12-04T21:09:16Z

In const-ae/sparseMatrixStats#13 (comment) @hpages argues for:

rowVars(X, center=0, na.rm=TRUE)

to stay with the rationale that it provides an efficent way to calculate:

rowSums(X^2 / (rowSums(!is.na(X))-1L), na.rm=TRUE)

BTW, the latter can be improved further; see below for details.

Issue

In the next release, I've decided to tighten up what center can take, specifically, it must be of the same length as the number of rows in X, cf. #187 (comment). Using a scalar, as here, will result in a deprecation warning, e.g.

Warning message:
Argument 'center' should be of the same length as number of rows of 'x'.
Use of a scalar value is deprecated: 1 != 3

Now, although unclear, the purpose of the argument center was always meant to provide a way to avoid having to re-estimate center if already estimated, e.g.

mean_and_var <- function(X) {
  mean <- rowMeans(X)
  var <- rowVars(X, center = mean)
  list(mean = mu, var = var)
}

Using rowVars(X, center = 0) as an efficient version of

rowSums(X^2 / (rowSums(!is.na(X))-1L), na.rm=TRUE)

is more of a side-effect rather than being the intended usage. Point is, this use of center is different from the original use of center and there's always a risk of mistakes, misunderstandings, and bugs when something has more than one purpose.

For the above reasons, unless someone can convince me otherwise, my plan is to go ahead a deprecate scalar values on center as is center = 0.

BTW, as it stands now, with the next release, using:

rowVars(X, center = 0)

will still work but to get rid of the above warning one needs to do:

rowVars(X, center = rep(0, times = nrow(X)))

to achieve the same thing. From the below benchmarking, this approach is ~25% faster compared to the version that does not use a center argument.

Benchmarking

We have:

f1 <- function(X) {
  rowSums(X^2 / (rowSums(!is.na(X))-1L), na.rm=TRUE)
}

but a more efficent implementation of this is:

f2 <- function(X) {
  rowSums(X^2, na.rm = TRUE) / (rowSums(!is.na(X)) - 1L)
}

Now, rowSums(!is.na(X)) can be calculated as:

ncol(X) - rowSums(is.na(X))

This avoids the negation !is.na(X) of a full matrix;

f3 <- function(X) {
  rowSums(X^2, na.rm = TRUE) / (ncol(X) - rowSums(is.na(X)) - 1L)
}

We can also avoid the is.na(X) "coercion" by using rowCounts() as in:

ncol(X) - rowCounts(X, value = NA_real_)

Thus, a more efficent version of the above is:

f4 <- function(X) {
  rowSums(X^2, na.rm = TRUE) / (ncol(X) - 1L - rowCounts(X, value = NA_real_))
}

> f0 <- function(X) rowVars(X, center = rep(0, times = nrow(X)))
> bench::mark(f0(X), f1(X), f2(X), f3(X), f4(X))[,1:9]
# A tibble: 5 x 9
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f0(X)        24.5µs   28.5µs    31883.    52.7KB     31.9  9990    10      313ms
2 f1(X)        69.7µs   75.9µs    12894.   101.5KB     23.5  6046    11      469ms
3 f2(X)        63.1µs   70.4µs    13995.   101.5KB     28.1  6477    13      463ms
4 f3(X)        57.9µs   61.8µs    15637.    76.4KB     23.6  7304    11      467ms
5 f4(X)        34.5µs   37.9µs    25106.    51.1KB     22.6  9991     9      398ms

Comment: When specifying center, rowVars() is not fully using native code, i.e. f0() would be faster if/when we re-implement center in native code.

The text was updated successfully, but these errors were encountered:

hpages · 2020-12-04T23:00:43Z

When specifying center, rowVars() is not fully using native code.

I'll just reiterate that this is really at the center of this entire discussion. Again, all the problems, misunderstandings, frustration, and sadness around the current handling of center are only side effects of the unfortunate situation that rowVars(..., center=my_center) is not using the same path as when center is not specified.

HenrikBengtsson · 2020-12-04T23:03:12Z

Hmm... that's just a difference in performance - I'm after the behavior here.

hpages · 2020-12-04T23:11:20Z

Don't know how it is now but last time I checked (a couple of weeks ago) it was both performance and behavior. The 2 paths were not using the same formula. My point is that if rowVars() had used a single path and single formula, whether center is specified or not, a lot of the discussions about what to keep or drop, benchmarks, surprising behaviors, etc... would become moot.

HenrikBengtsson · 2020-12-04T23:15:27Z

But only if the native code had implemented the primary formula and not the alternative one. My point here is, let's not look what's going on inside but what the API should provide and what role center should have.

hpages · 2020-12-04T23:23:26Z

But once the inside is fixed, the behavior becomes healthy and there is no need a priori to try to artificially block the use of arbitrary centers. It's counter-productive.

HenrikBengtsson · 2020-12-05T00:05:10Z

Ok, let's simplify the discussion so we can get to the same page. You're the first one I've heard arguing for using center = 0. Can we both agree that that is a very special case that only applies to (L2) sample variance estimation? More specifically, can we agree that using:

length(center) == 1 && center != 0

is a mistake and we should produce an error when that is used?

hpages · 2020-12-05T00:35:18Z

I don't see this as a mistake. If that's what the user wants to do then so be it.

The only thing that matters is that the user knows what the function is doing with center. Right now they don't, which again, is at the root of the problem.

So to me this is a very simple situation where you just need to say that rowVars(x, center) does:

sum((x[i, ] - center[i])^2) / (length(x[i, ]) - 1)

for each row, and that center will be set to center <- rowMeans(x) by default. That's really all there is to it. Yes, it's all about clarifying the semantic. Then the user can decide what they do with center. I don't think that trying to detect the valid or invalid uses of center is that useful or productive.

HenrikBengtsson · 2020-12-05T00:38:45Z

Mkay... Let's move on. For me to understand your center = 0 case - is:

sum((x - center)^2) / (length(x) - 1)

used to estimate a variance? If not, in what use case does it come up for you?

hpages · 2020-12-05T00:41:10Z

My use case is not important. If I can't use rowVars(x, center=0) for it, I'll use something else. No worries.

Thanks

HenrikBengtsson · 2020-12-05T00:45:50Z

But, it's useful for me to decide what to do here. If rowVars(x, center = 0) is not used to estimate the variance, then it's a clear case for me - whatever the calculation is done should not be done by rowVars() because that exists in order to estimate the variance. Any other use is basically just surfing on side effects from how it's calculated (which is something that was never clearly defined, hence Issue #187).

hpages · 2020-12-05T00:58:42Z

My use case is to support scale() on DelayedMatrix objects: https://github.com/Bioconductor/DelayedArray/blob/aa9a74556f5c36307ceb638eb31199477139fea8/R/DelayedArray-utils.R#L857

Yes, it's fully understood that when center is not set to rowMeans() then what rowVars() returns should not longer be considered the row variances, in the usual sense. More something like "centered row variances". It doesn't really need a name, as long as it's well defined and coincides with the row variances by default.

HenrikBengtsson · 2021-01-20T21:20:03Z

Sorry for the delay here. I didn't consider scale() and that it supports pre-estimated center values, which by the way opens up to the same issues we've had when an incorrect, or a biases center is given in rowVars() and friends.

Looking at scale(..., center), we find that it also does not support scalar center values and it requires that it's of the correct length, e.g.

> X <- matrix(1:6, nrow = 2)
> scale(X, center = 0)
Error in scale.default(X, center = 0) : 
  length of 'center' must equal the number of columns of 'x'

In order to move forward now and later on (e.g. implement support for center in native code), and to minimize the risk for user mistakes, I've decided to move forward and deprecate support for scalar center. Hoping to submit the release asap (= after running 300+ revdep checks).

HenrikBengtsson added this to the Next release milestone Dec 4, 2020

HenrikBengtsson mentioned this issue Dec 4, 2020

API cleanup: Drop argument center? #187

Closed

HenrikBengtsson changed the title ~~rowVars(..., center=0): Should I stay or should I go? (mostly like go)~~ rowVars(..., center=0): Should I stay or should I go? (most likely go) Dec 5, 2020

const-ae mentioned this issue Dec 6, 2020

Argument 'center' package tests Bioconductor/MatrixGenerics#20

Closed

HenrikBengtsson closed this as completed Jan 20, 2021

ycl6 mentioned this issue Feb 2, 2021

Using a scalar value for argument 'center' is now deprecated. SingleR-inc/SingleR#177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rowVars(..., center=0): Should I stay or should I go? (most likely go) #193

rowVars(..., center=0): Should I stay or should I go? (most likely go) #193

HenrikBengtsson commented Dec 4, 2020 •

edited

Loading

hpages commented Dec 4, 2020

HenrikBengtsson commented Dec 4, 2020

hpages commented Dec 4, 2020

HenrikBengtsson commented Dec 4, 2020

hpages commented Dec 4, 2020 •

edited

Loading

HenrikBengtsson commented Dec 5, 2020

hpages commented Dec 5, 2020 •

edited

Loading

HenrikBengtsson commented Dec 5, 2020

hpages commented Dec 5, 2020

HenrikBengtsson commented Dec 5, 2020

hpages commented Dec 5, 2020

HenrikBengtsson commented Jan 20, 2021

rowVars(..., center=0): Should I stay or should I go? (most likely go) #193

rowVars(..., center=0): Should I stay or should I go? (most likely go) #193

Comments

HenrikBengtsson commented Dec 4, 2020 • edited Loading

Issue

Benchmarking

hpages commented Dec 4, 2020

HenrikBengtsson commented Dec 4, 2020

hpages commented Dec 4, 2020

HenrikBengtsson commented Dec 4, 2020

hpages commented Dec 4, 2020 • edited Loading

HenrikBengtsson commented Dec 5, 2020

hpages commented Dec 5, 2020 • edited Loading

HenrikBengtsson commented Dec 5, 2020

hpages commented Dec 5, 2020

HenrikBengtsson commented Dec 5, 2020

hpages commented Dec 5, 2020

HenrikBengtsson commented Jan 20, 2021

HenrikBengtsson commented Dec 4, 2020 •

edited

Loading

hpages commented Dec 4, 2020 •

edited

Loading

hpages commented Dec 5, 2020 •

edited

Loading