Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOSS] Comments #769

Closed
11 tasks done
rcannood opened this issue May 5, 2022 · 25 comments
Closed
11 tasks done

[JOSS] Comments #769

rcannood opened this issue May 5, 2022 · 25 comments
Assignees
Labels

Comments

@rcannood
Copy link
Contributor

rcannood commented May 5, 2022

Hi all!

I'm currently reviewing anndata's submission to JOSS (openjournals/joss-reviews#4371). I think the paper is very well written and is worthy of publication at JOSS. However, in going over the reviewer checklist provided by JOSS, there are some minor outstanding issue which I cannot check off at the moment.

Below is an overview of all my relatively minor comments. I'm happy to discuss them in this thread or in a separate issue, if need be.

Robrecht


General issues

  • Repository It seems the repository has been renamed from theislab/anndata to scverse/anndata. Please fix this in JOSS.

Functionality

  • Performance "If there are any performance claims of the software, have they been confirmed?"
    There are various claims throughout the paper that anndata takes into account scRNAseq's dataset sizes in terms of cpu and memory usage. For example:
    "Due to the increasing scale of data, we emphasized efficient operations with low memory and runtime overhead."
    "In particular, AnnData takes great pains to support efficient operations with sparse data."
    I totally agree with the authors on this matter. However, I don't see any benchmarks in the paper or the documentation where anndata is compared to SCE, Seurat and/or loom in terms of memory usage, disk usage and/or execution time.

Documentation

  • Installation instructions I don't see a "clearly-stated list of dependencies" in the documentation, as required by JOSS. I think the installation instructions could be expanded upon (e.g. https://scanpy.readthedocs.io/en/stable/installation.html).
  • Example usage "Do the authors include examples of how to use the software (ideally to solve real-world analysis problems)." There are examples on concatenation and the on-disk format, but I don't see any usage examples on related to the core API of anndata.
  • Community guidelines The contributing guidelines on readthedocs says "To be expanded."

Software paper

  • State of the field The paper mentions "... where advances in single-cell RNA sequencing (scRNA-seq) gave rise to new classes of analysis problems with a stronger adoption of Python over the traditional R ecosystem." Please provide a citation for this claim. https://www.scrna-tools.org/analysis (Zappia et al. 2018) seems to show the opposite.
  • Quality of writing The text in Figure 1 is too small to read when printed out on paper.
  • Quality of writing anndata is sometimes written as AnnData. I assume a distinction is made between anndata the package and AnnData the object, but in the text AnnData sometimes seems to refer to the object as well. Please clarify.
  • Quality of writing In the sentence "In the context of how (Wickham, 2014) recommends ...", I feel you should use an inline citation - that is @Wickham2014 instead of [@Wickham2014].
  • State of the field "For analysis of scRNA-seq data in R this has been further simplified by anndata2ri, which allows conversion to SingleCellExperiment and Seurat’s data format." I think it's also worth mentioning the 'anndata for R' on CRAN, which provides a reticulate wrapper to the anndata Python package. I believe this package has a more intuitive interface modifying anndata files when only small modifications need to be made or for people with anndata but not with SCE or Seurat.
  • State of the field If you want additional examples of anndata's application domain, or, if you think mention of the following is an improvement to the manuscript: anndata is heavily used in OpenProblems, a community driven platform for benchmarking tasks in single-cell analysis (openproblems.bio). For example, in a NeurIPS 2021 competition (Luecken et al 2021, https://openreview.net/forum?id=gN35BGa1Rt) use it as the common data format throughout the whole benchmarking pipeline.
@falexwolf
Copy link
Member

Thank you very much for the review, @rcannood! We'll get back to you!

@falexwolf
Copy link
Member

I'm sorry it takes us so long.

We need a bit more time to compile the benchmarks we have. We'll get back in a week.

@rcannood
Copy link
Contributor Author

Hi Alex et al.! How are things going? Let me know if anything is unclear or a misinterpretation from my end. I'd be happy to discuss! :)

@falexwolf
Copy link
Member

falexwolf commented Jun 2, 2022

Robrecht!

Apologies for the slow response!

Thank you very much for the in-depth review. It definitely helped us improve the paper!

General issues

  • We fixed the repository link! d409c48

Functionality → performance → benchmarks

Documentation

  • Installation
    • We list the dependencies in pyproject.toml:

      anndata/pyproject.toml

      Lines 43 to 53 in a3af650

      dependencies = [
      "pandas>=1.1.1", # pandas <1.1.1 has pandas/issues/35446
      "numpy>=1.16.5", # required by pandas 1.x
      "scipy>1.4",
      "h5py>=3",
      "natsort",
      "packaging>=20",
      # for getting the stable version
      "importlib_metadata>=0.7; python_version < '3.8'",
      "typing_extensions; python_version < '3.8'", # Remove once we depend on python > 3.7
      ]
    • We agree with you that a more comprehensive installation page in the docs could be beneficial! We will add this page similar to what we set up for Scanpy.
  • Example usage
  • Community guidelines

Software paper

  • State of field
    • We realize how the current statement about the adoption of Python was confusing to read! Thank you for pointing this out!
    • We clarified it and explicitly referenced the following Figure 2a from Zappia21, which reviews the same database (scrna-tools.org) that you're citing! c0a2d97

image

  • Quality of writing: figure text in figure 1
    • Agreed on this! We will remake the figure. Would you think the figure is acceptable if we increase the fontsize of “obs_names” and “var_names”? Or do you think we have to increase the fontsize everywhere?
  • Quality of writing: anndata is sometimes written as AnnData.
    • Indeed, AnnData denotes the class (data structure), whereas anndata denotes the software package (python module).
    • We clarified this in the manuscript! ed59ed0
  • Quality of writing: inline citation
  • State of the field:
    • You're absolutely right! Apologies for the omission. We added a reference to the CRAN anndata package! 67741a1
    • We now also mention zellkonverter: 93ce15a
  • State of the field: examples
    • We added a reference to the benchmarking paper! b4684e3

All changes discussed in this response can be viewed together in this PR: #779. The up-to-date version of the paper is in the paper branch #666, as required by JOSS.

We'd be grateful for guidance on the question regarding Figure 1! We're also happy to further adapt the paper to your suggestions!

@rcannood
Copy link
Contributor Author

rcannood commented Jun 7, 2022

Thanks for your hard work @falexwolf and colleagues! I ticked off the check boxes for which I have no further questions or comments. I also updated the progress of my review checklist at openjournals/joss-reviews#4371 (comment) accordingly :)

My responses to your comments:

  • General issues → Repository

    We fixed the repository link! d409c48

    The first issue at [REVIEW]: anndata: Access and store annotated data matrices openjournals/joss-reviews#4371 still lists github.com/theislab/anndata instead of scverse/anndata. I think it's important that this is fixed, because the url in the JOSS article itself also still points to the theislab organisation.

  • Functionality → Performance

    It makes a comparison with the loom on-disk format here: https://anndata.readthedocs.io/en/latest/benchmark-read-write.html

    The read/write benchmark is a good start! This only covers a very specific aspect of AnnData and only compares to loom. I'm not sure I'm wholly satisfied that this is sufficient to substantiate your claims regarding performance. The manuscript states: "Due to the increasing scale of data, we emphasized efficient operations with low memory and runtime overhead."

    • "Increasing scale of data" → the dataset used in this benchmark only contains 2300 cells
    • "efficient operations" → the same paragraph lists these operations:
      • out of core conversions between dense and sparse data
      • lazy subsetting
      • per-element operations for low total memory usage
      • in-place subsetting
      • combining AnnData objects with various merge strategies
      • lazy concatenation
      • batching
      • backed out-of-memory mode
    • "with low memory [...] overhead" → the benchmark currently doesn't cover memory overhead.
    • "low [...] runtime overhead" → the current benchmark compares against loom to demonstrate runtime efficiency. I wonder, how does it stack up against SCE when performing similar operations?
  • Functionality → Performance

    It links to https://github.com/ivirshup/anndata-benchmarks, an entire devoted to benchmarking anndata over its entire lifetime using airspeed velocity.

    It's very cool that you're able to track the performance of anndata in this way! However, I'm having a hard time browsing the results from this workflow. Are the results of the ASV benchmarks the one that are published in readthedocs?

  • Functionality → Performance

    We will add a more comprehensive comparative benchmarks in the docs, link it from the landing page and maintain it.

    Let me know when I can take a look at the extended benchmarks. I realise that the above-mentioned comments regarding the benchmarks are quite extensive. Maybe we can ask @luizirber to chime in on to what extent each and every performance-related claim in the paper needs to be backed by benchmarks?

  • Documentation → Installation

    We agree with you that a more comprehensive installation page in the docs could be beneficial! We will add this page similar to what we set up for Scanpy.

    Sounds good! Let me know when you'd like to me to take a look at that.

  • Quality of writing → figure text in figure 1

    Agreed on this! We will remake the figure. Would you think the figure is acceptable if we increase the fontsize of “obs_names” and “var_names”? Or do you think we have to increase the fontsize everywhere?

    Increasing the size of just the obs_names, var_names should be sufficient :)

@falexwolf
Copy link
Member

Thanks a lot for this second round, @rcannood! (Why am I so slow in responding? Sorry!)

I pinged the editor re the repo location and am talking to @ivirshup about the rest!

@ivirshup
Copy link
Member

Hey @rcannood, sorry also missed the update here!

"Increasing scale of data" → the dataset used in this benchmark only contains 2300 cells

Looking into making something more up to date here!

It's very cool that you're able to track the performance of anndata in this way! However, I'm having a hard time browsing the results from this workflow. Are the results of the ASV benchmarks the one that are published in readthedocs?

They are not.

At the moment, we just treat this as tooling for development, and don't publish results. We need to get dedicated hardware set up to have meaningful historical results (since performance can be quite variable between machines otherwise).

Additionally, the benchmarks here can be quite specific, and I'm not sure they would be informative by themselves. Here is an example of output: https://www.astropy.org/astropy-benchmarks/

@nlhepler
Copy link

nlhepler commented Jun 21, 2022

Hello all!

Thank you for the very well-written paper and similarly well-built and well-maintained anndata package. I concur with @rcannood that it is worthy of publication, after a few things have been addressed:

  • It's already been noted, but the package has since moved to scverse/anndata. I presume this will get fixed in time for publication, but it's also not necessarily blocking as an auto-redirect is in place.
  • I'm not sure if it's because the figures possibly appeared first in poster form, but the figure text is far too small for the 8.5x11" format. In particular Figure 1 and Figure 3 need to have the text resized to be at least 10pt (including/especially the TeX bits). Figure 2 could also use some massaging, but its text is not nearly as egregiously tiny.
  • The data structure section is missing a description/enumeration of the layers functionality of the format. This is described in the Figure 1 caption but absent from the main text.
  • The same section is also missing a description of uns, which is also present in Figure 1's caption. It is possibly more important to describe than layers as you refer to uns on line 132 of the main text.
  • I have a question: why are single- and multi-dimensional annotations separated into var and obs vs varm and obsm respectively? A single-dimensional annotation has an implicit dimension of 1 and can leverage the same datastructure as annotations with dimension >1? It may be worth describing the thought process behind this decision in the paper.
  • The data analysis workflow section, line 74: "kept track off by adding"
  • The on-disk format section, line 108: it would be clearer to say "HDF5 AnnData format" instead of .h5ad, or to parenthetically define .h5ad and .zarr on line 101.
  • Another question: line 108: you say the schema is versioned and stored in an internal registry. Is this registry internal to every file, as in a self-describing format, or internal to the anndata project? Either answer is acceptable, it may help to clarify in the paper.

A couple remarks I think warrant some interaction/debate:

  • I think it is maybe inaccurate to imply AnnData is a datastructure library (The AnnData object section). It seems to leverage existing datastructure libraries (for integration purposes, like numpy, scipy.sparse, and pandas), and provides a layer of APIs on top of these datastructures that render experimental data of this form easy to programmatically manipulate, integrate with downstream analyses, and serialize to and from disk.
  • Related to the above, Between the Introduction and the first sentence of The AnnData object section there's an indirect statement that AnnData is fulfilling an important niche: providing ExpressionSet-like functionality for Python language users, and integrating with the relevant packages in the Python ecosystem to provide access to e.g. modern machine learning and other (scientific) software unique to the Python ecosystem. It may be better to directly state this.

I also have a few other comments that I would like to share that (I do not think block publication but) might be helpful if you would like to further polish this submission or if you intend to submit a follow-up paper in the future:

  • AnnData is relatively mature, having developed originally as part of the scanpy project. As such, I think the Outlook section (or another) could contain a discussion of lessons learned and/or resolved, in-scope for future development efforts, or out-of-scope but something to think about in any rewrite/successor project. Comparisons with prior versions could also be a source of benchmarks.
  • Most papers describing new software would have benchmarks showcasing absolute performance claims. The paper as-is describes its performance qualitatively, rather than in quantitative terms, which is less useful/informative, but does indicate that the authors are aware how important time/space efficiency are to practical use of the project. Unfortunately, AnnData does not have any serious competitors in the Python ecosystem, and the R-based competitors are ... well it can often be a totally different set of trade-offs. So "advancing state of the art" here would be admittedly difficult to showcase. Instead it might be more appropriate to quantify the amount of boilerplate saved by using AnnData over rolling one's own data loaders/accessors, and to extol more assertively the benefits of standardized on-disk formats.
  • Related to the above, I would appreciate any future papers on AnnData to engage in self-comparative benchmarks to showcase the advantages of any new features/APIs/developments/etc.
  • As someone who has worked on similar software in a corporate setting for highly efficient storage and high-performance analysis of large datasets (100K+ cells) on commodity PC hardware, I know how important it is to provide low-level APIs that can be leveraged to write highly efficient applications. This is especially true in the context of lazy subsetting where you really do not want to fully instantiate a sub-matrix and possibly double your memory usage and do lots of unnecessary copying. I would really appreciate a detailed treatment of any such APIs in this or any future paper. In fact, I would argue that such work is what has made numpy such an important and successful project, as low-level high-performance APIs made it feasible to build many downstream packages (e.g. scipy, sklearn, etc) and integrations (e.g. Cython).

Thank you all again for the opportunity to review this paper and render remarks!

Best,

Lance Hepler

@falexwolf
Copy link
Member

Thank you for this in-depth review! It provides very valuable feedback!

We'll thoroughly respond, but it's likely gonna take a little time.

@rcannood
Copy link
Contributor Author

Hi @falexwolf ! Any updates on this issue? (I feel Lance should have created a separate issues, so the topics can be tackled separately instead of being intermingled, but ok.) Many of my comments are relatively minor, so it'd be a pity of them to hold up the submission of AnnData to JOSS.

@falexwolf
Copy link
Member

Hi @rcannood, @ivirshup and I made a list and we worked out many of them (e.g., benchmarking). We'll definitely get back.

I'm sorry that I on my end got much busier with @laminlabs in the past months, but we'll definitely not leave this unfinished.

I'll ping Isaac to have another meeting to finalize this.

@falexwolf
Copy link
Member

falexwolf commented Oct 2, 2022

@ivirshup & I are scheduled to meet today for the first time after 2 months. We split up the work. I'll add my responses here for now and very much hope we can finalize with another comment in this thread later today.

I'll address Lance's points in the same order and grouping as he provided them.

Regarding the text:

  1. "repo location": Fixed in all places where we could.
  2. "figure text": @ivirshup, next comment.
  3. "layers": 1516022
  4. "uns": e9a5b02
  5. "Why are one- and multi-dimensional annotations separated in two fields?": Two reasons are
    a. Usage patterns: One-dimensional annotations are used differently than multi-dimensional annotations. They always serve as labels with some semantic content (even if numeric), while multi-dimensional annotations store representations of the data that need additional algorithms for interpretation (assigning semantic content). Having two slots for grouping things that are used for similar things seemed like a good UX (making a few canonical things easy, like, "benchmark these alternative latent representations", "combine these labels into one", "fit a model on these conditions").
    b. One-dimensional annotations have a simple data type and a canonical storage format for grouping them (a DataFrame). But back in 2017, even this couldn't easily be stored persistently in an HDF5. So, when writing the on-disk format, I had to write a persistent serialization of a DataFrame and of multi-dimensional annotations. Both was feasible without too much work when done separately as each of these had simple data types, but I wouldn't have wanted to mix it (allowing for multi-dimensional annotations with mixed data types). @ivirshup implemented the latter years later.
  6. Typo: 24895a8
  7. "h5ad": 66962c5
  8. "schema tracking": @ivirshup, next comment.

Debate:

  1. "I think it is maybe inaccurate to imply AnnData is a data structure library (The AnnData object section).": @ivirshup & I discussed this remark in early August and thought we'd prefer to keep the term "data structure". While we agree that AnnData is based on existing data structures, it composes these to define a new data structure. We think this is a bit similar to how pandas stores data internally in numpy arrays, but composes them to give rise to a much richer structure, the DataFrame. Like pandas and other "data structure libraries", anndata offers APIs to conveniently interact with the new data structure.
  2. "Clearer statement about filling a gap". We agree: 4bbf466

Outlook:

  1. "Lessons learned & benchmarks". @ivirshup will respond re benchmarking as he has spent much over the past years. I'm not sure whether there are very specific "lessons learned" for anndata. Clearly, today, a lot could be done differently. But back then, certain requirements made the adoption of say xarray hard to impossible. Isaac may have more things to say.
  2. "Benchmarking": @ivirshup will address this.
  3. "Self-comparative benchmarks": We agree!
  4. "Low-level API": @ivirshup will address this.

Let us follow up with another post to address all outstanding remarks.

All changes discussed in this response can be viewed together in this PR: #825. They follow on the changes made in response to @rcannood (#779 / #769 (comment)).

We'll also address @rcannood's outstanding points beyond #769 (comment).

The up-to-date version of the paper is in the paper branch #666, as required by JOSS.

@nlhepler
Copy link

nlhepler commented Oct 11, 2022

Thank you for the responses! A remaining nitpick.

Text:

  1. I presume I'm waiting on @ivirshup?
  2. "which allows to store" is awkward phrasing --> "which allows one to store" or "which allows storing", etc.
  3. Thank you for this explanation! It is probably too much detail to put into the main text, I wish there was a way to include supplemental information like this.

Debate:

  1. I will concede. It is likely more sensible than any other description.

@rcannood
Copy link
Contributor Author

@nlhepler Next time, please create a separate issue for your own comments. It's becoming hard to track what still needs to be done :)

@rcannood
Copy link
Contributor Author

To recap, I think my remaining issues are:

General issues

  • Repository It seems the repository has been renamed from theislab/anndata to scverse/anndata. Please fix this in JOSS.

Functionality

  • Performance "If there are any performance claims of the software, have they been confirmed?"
    Would it be possible to include an atlas-sized dataset (e.g. >100'000 cells) in the benchmarks included on the website / somewhere?

Documentation

Let's make one more push towards getting AnnData published in JOSS. Let's go @falexwolf @ivirshup !

@falexwolf
Copy link
Member

falexwolf commented Oct 21, 2022

I'm fully onboard, @rcannood - I think the remaining points are what you point out + the font sizes of the figures. The repo location can't be fixed within the paper itself, there is no metadata section. So, I'll clarify that with the editors. Unfortunately, I can't do the other things independently but need @ivirshup for this.

@ivirshup, let's go! We've formally worked on this for more than 2 years (https://github.com/ivirshup/anndata-paper/graphs/contributors) and had prepared it for even longer. 😅 It'd be a shame if we didn't finish it after so much time.

@falexwolf
Copy link
Member

@ivirshup - Trying to ping you here to get this back on your radar. 😅 Let me know if I can help in any way!

@rcannood
Copy link
Contributor Author

I met up with @ivirshup during the scverse hackathon. He mentioned that there are benchmarks on ±100'000 cells performed by a CI. I'm eager to see the results :)

@falexwolf
Copy link
Member

Great that you caught @ivirshup, @rcannood! I'm eager to see the results, too!

@rcannood
Copy link
Contributor Author

rcannood commented May 31, 2023

Hi @ivirshup! Do you have any updates on the availability of the benchmarks on decently sized datasets?

@rcannood
Copy link
Contributor Author

rcannood commented Jun 7, 2023

Having passed the 1 year anniversary for the AnnData JOSS submission, I'd really like to have this submission wrapped up and dealt with. AnnData is a great resource to researchers all over the world, and I think it'd be a pity if this work doesn't get published at JOSS.

@falexwolf The paper mentions:

Furthermore, anndata is systematically benchmarked for performance using airspeed velocity [@Droettboom13], with the results linked from the docs.

@falexwolf Can you remind me where these results are published? The Benchmarks page only links to a simple benchmark and the ivirshup/anndata-benchmarks repository, but I don't see any actual results. Could you help me find them?

@falexwolf
Copy link
Member

All benchmarking referenced here is by @ivirshup, I'm not in the loop on the details. @ivirshup, could you help out?

@github-actions github-actions bot added the stale label Aug 7, 2023
@flying-sheep flying-sheep added pinned and removed stale labels Aug 7, 2023
@scverse scverse deleted a comment from github-actions bot Aug 7, 2023
@falexwolf
Copy link
Member

Downtuned performance claims as requested by the editor: #1267

See discussion in the JOSS review: openjournals/joss-reviews#4371

@falexwolf
Copy link
Member

This completes a sequence of 3 PRs to the paper branch

For reference, the paper branch is here:

@rcannood
Copy link
Contributor Author

Thanks @falexwolf! This resolves all of my outstanding comments :) Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants