-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowVars(..., center=0): Should I stay or should I go? (most likely go) #193
Comments
I'll just reiterate that this is really at the center of this entire discussion. Again, all the problems, misunderstandings, frustration, and sadness around the current handling of |
Hmm... that's just a difference in performance - I'm after the behavior here. |
Don't know how it is now but last time I checked (a couple of weeks ago) it was both performance and behavior. The 2 paths were not using the same formula. My point is that if |
But only if the native code had implemented the primary formula and not the alternative one. My point here is, let's not look what's going on inside but what the API should provide and what role |
But once the inside is fixed, the behavior becomes healthy and there is no need a priori to try to artificially block the use of arbitrary |
Ok, let's simplify the discussion so we can get to the same page. You're the first one I've heard arguing for using
is a mistake and we should produce an error when that is used? |
I don't see this as a mistake. If that's what the user wants to do then so be it. The only thing that matters is that the user knows what the function is doing with So to me this is a very simple situation where you just need to say that
for each row, and that |
Mkay... Let's move on. For me to understand your
used to estimate a variance? If not, in what use case does it come up for you? |
My use case is not important. If I can't use Thanks |
But, it's useful for me to decide what to do here. If |
My use case is to support Yes, it's fully understood that when |
Sorry for the delay here. I didn't consider Looking at > X <- matrix(1:6, nrow = 2)
> scale(X, center = 0)
Error in scale.default(X, center = 0) :
length of 'center' must equal the number of columns of 'x' In order to move forward now and later on (e.g. implement support for |
In const-ae/sparseMatrixStats#13 (comment) @hpages argues for:
to stay with the rationale that it provides an efficent way to calculate:
BTW, the latter can be improved further; see below for details.
Issue
In the next release, I've decided to tighten up what
center
can take, specifically, it must be of the same length as the number of rows inX
, cf. #187 (comment). Using a scalar, as here, will result in a deprecation warning, e.g.Now, although unclear, the purpose of the argument
center
was always meant to provide a way to avoid having to re-estimate center if already estimated, e.g.Using
rowVars(X, center = 0)
as an efficient version ofis more of a side-effect rather than being the intended usage. Point is, this use of
center
is different from the original use ofcenter
and there's always a risk of mistakes, misunderstandings, and bugs when something has more than one purpose.For the above reasons, unless someone can convince me otherwise, my plan is to go ahead a deprecate scalar values on
center
as iscenter = 0
.BTW, as it stands now, with the next release, using:
will still work but to get rid of the above warning one needs to do:
to achieve the same thing. From the below benchmarking, this approach is ~25% faster compared to the version that does not use a
center
argument.Benchmarking
We have:
but a more efficent implementation of this is:
Now,
rowSums(!is.na(X))
can be calculated as:This avoids the negation
!is.na(X)
of a full matrix;We can also avoid the
is.na(X)
"coercion" by usingrowCounts()
as in:Thus, a more efficent version of the above is:
Comment: When specifying
center
,rowVars()
is not fully using native code, i.e.f0()
would be faster if/when we re-implementcenter
in native code.The text was updated successfully, but these errors were encountered: