-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feedback about JSS/arxiv paper #620
Comments
@tdhock thanks a lot for your feedback and these comparisons. I have gotten some similar comments from JSS reviewers regarding the benchmark section, I.e., that it should accommodate different data sizes. Overall though this section should not become too large, as collapse is already part of two fair benchmarks. Thanks also for alerting me to your paper! I just read through your example. Very interesting. Though I have to note that the fastest way to do aggregation pivots with collapse is to use the hard-coded internal functions (”mean” in quotes) instead of mean. I will see if such detailed exposition can make it into the articley |
Thanks for the info about By the way, I am leading a project about expanding the group of contributors to R data.table, and I would like to invite you to participate. Even if you don't have any knowledge of the internal workings of data.table, I believe your C coding experience, and your focus on efficiency, would be a valuable asset to the Also there is travel funding available, which could pay for you to give a talk related to data.table at any relevant meeting (R/stats related conference, meetup, etc), for more info, please see the application guidelines: https://rdatatable-community.github.io/The-Raft/posts/2023-11-01-travel_grant_announcement-community_team/ |
@tdhock thanks. 2x? The internal function can be up to 50 times faster, depending on the problem size. options(fastverse.styling = FALSE)
library(fastverse)
#> -- Attaching packages --------------------------------------- fastverse 0.3.3 --
#> Warning: package 'data.table' was built under R version 4.3.1
#> v data.table 1.15.4 v kit 0.0.17
#> v magrittr 2.0.3 v collapse 2.0.16
DT <- data.table(id = sample.int(1e6, 5e7, TRUE),
variable = sample(letters, 5e7, TRUE),
value = rnorm(5e7))
system.time(pivot(DT, how = "w", FUN = mean)) # Split-apply-combine using base::mean
#> user system elapsed
#> 83.900 10.430 98.158
system.time(pivot(DT, how = "w", FUN = fmean)) # Vectorized across groups with fmean + deep copy
#> user system elapsed
#> 3.843 0.773 5.061
system.time(pivot(DT, how = "w", FUN = "mean")) # Internal: on the fly (1-pass, running mean)
#> user system elapsed
#> 2.089 0.121 2.227
system.time(pivot(DT, how = "w", FUN = "mean")) # To confirm
#> user system elapsed
#> 2.049 0.117 2.220 Created on 2024-09-03 with reprex v2.0.2 A main difference between collapse and data.table is that vectorizations in collapse are always explicit, i.e., there is no internal GeForce mechanism to substitute base R with optimized functions but explicit fast statistical functions, and, in the case of And thanks a lot for the invite to contribute to data.table. I would say I have a fair understanding of parts of the data.table source code and could of course get engaged, but I don't think I'll be able to afford a lot of time there as collapse is also a very large software project and I have many other commitments. |
Thanks @tdhock, also for showcasing your And just as a side note regarding CRAN dependencies. I think the community and particularly package developers would do very well to also give some more attention to the fastverse project - to which I am also happy to invite some of you guys if interested (and always happy to feature new packages). Tabular data is not the most efficient structure for many statistical operations - one reason I have opted to create a class-agnostic statistical computing architecture in collapse that also supports vectors and matrices. There are many other (smaller) and lightweight packages that are really efficient at doing certain things. In my own smaller packages such as osmclass and dfms I use data.table and collapse (and others) but not for the tabular stuff. For example from data.table I find |
to give more attention to fastverse/collapse project, you may consider submitting it for the Seal of Approval, https://github.com/Rdatatable/data.table/blob/master/Seal_of_Approval.md |
Hi @SebKrantz I read your paper about collapse, https://arxiv.org/pdf/2403.05038 and here are some comments/questions that are meant to help improve the paper (as in peer review).
Section 9 (Benchmark) does comparative timings, each for a single data size:
m <- matrix(rnorm(1e7), ncol = 1000)
Whereas this approach is simple to implement, it has two drawbacks (1) it is difficult/impossible to for others to reproduce the exact timings on other CPUs, and (2) it may hide important details about the asymptotic properties of the approach. (asymptotic means when N is very large, does the code take time which is linear, log-linear, quadratic, in N?)
I think it would be much more convincing if you did asymptotic analysis, plotting time versus data size N, which resolves both issues.
You could use my atime package which makes it easy to do such analyses, for example from https://tdhock.github.io/blog/2024/collapse-reshape/
the figure above shows that collapse is faster than alternatives in terms of absolute computation time, but it also shows a difference in slope between the methods, which implies a different complexity class, which is estimated in the figure below:
the figure above shows that, for the top panels (kilobytes), the black empirical timing lines are aligned with the violet N reference lines, suggesting O(N) memory complexity. For the bottom panels (seconds), the references suggest linear time, O(N), for
data.table
and log-linear time, O(N log N) for the others, which is a more complete description of the different approaches. Moreover, identifying these asymptotic complexity class differences suggests a clear avenue for improving the performance of the code (change the underlying algorithm to avoid a sort).Also I see that your paper discusses reshaping in section 5.2 (Pivots), but does not cite my nc paper, which discusses similar approaches to wide-to-long data reshaping, https://journal.r-project.org/archive/2021/RJ-2021-029/index.html including the Table 1 which is reproduced below, showing details about features implemented in different functions.
I think your paper would benefit by adding an analogous table, which would highlight the new features which you propose in collapse, and would explain which of those features are present/absent in previous work.
For example, one of the advantages that I see in collapse, is a single function (pivot) for both longer and wider operations (as well as recast). It would be great to see "recast" as one of the features in your table 1, where you could highlight that it is present in collapse, but not present in tidyr/data.table. Also I believe you should add a citation and comparison with cdata, which is another package implementing recast, see https://cloud.r-project.org/web/packages/cdata/vignettes/general_transform.html and other vignettes.
The text was updated successfully, but these errors were encountered: