Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for calculating niches #831

Open
wants to merge 74 commits into
base: main
Choose a base branch
from
Open

Add function for calculating niches #831

wants to merge 74 commits into from

Conversation

LLehner
Copy link
Member

@LLehner LLehner commented May 27, 2024

Description

Adds a function that calculates niches using different strategies. The initial function calculates niches based on neighborhood profiles similar to here.

This PR will get updated with methods discussed in #789.

@LLehner LLehner marked this pull request as draft May 27, 2024 16:24
@LLehner LLehner requested a review from timtreis May 27, 2024 16:43
@timtreis timtreis added the squidpy2.0 Everything releated to a Squidpy 2.0 release label May 29, 2024
@LLehner LLehner marked this pull request as ready for review October 10, 2024 12:44
@LLehner LLehner requested a review from giovp October 10, 2024 12:44
@LLehner
Copy link
Member Author

LLehner commented Oct 10, 2024

@giovp @timtreis PR is ready for review!

Some additional questions that came up, which you could have a look at:

  1. How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.
  2. I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?
  3. If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).
  4. Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

@timtreis
Copy link
Member

Hey @LLehner

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.

You mean a scenario where the user is just storing multiple slides in the same sdata object, like the point8, point16, ´point24` dataset we have? These should be fully independent since they have no biological connection.

I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?

Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.

If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).

Could it be that the OS has some of the compiled bytecode or data in cache? 🤔

Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

What runtime are we speaking here about? I tend to be a fan of some intermediate output (if one can also shut it up, f.e. with some verbosity level)

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks great @LLehner , I think it would benefit from a refactoring to take out the single implementation, validate arguments, and then run the functions. We discussed about potentially even doing separate modules, e.g.

squidpy.gr.niches.utag
...

but I'm not sure it is necessary, what do you and @timtreis think?
Keeping it like this would be also fine, but then the arguments should be handled a bit better (e.g. check whether the passed arguments are fit for the single implementation/flavour required, and otherwise fail right away). Also how each argument maps to each implementation could be documented in the docs.

Also some work to be done in the docs more generally.

) -> AnnData | pd.DataFrame:
"""Calculate niches (spatial clusters) based on a user-defined method in 'flavor'.
The resulting niche labels with be stored in 'adata.obs'. If flavor = 'all' then all available methods
will be applied and additionally compared using cluster validation scores.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space

%(adata)s
flavor
Method to use for niche calculation. Available options are:
- `{c.NEIGHBORHOOD.s!r}` - cluster the neighborhood profile.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be defined in constants and added with inject docs, see other docs on how to do it

- `{c.CELLCHARTER.s!r}` - cluster adjacency matrix with Gaussian Mixture Model (GMM) using CellCharter's approach.
- `{c.SPOT.s!r}` - calculate niches using optimal transport. (coming soon)
- `{c.BANKSY.s!r}`- use Banksy algorithm. (coming soon)
%(library_key)s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won' twork without injecting docs, see the read the docs build to see how it renders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the function is not added in the docstrings so it is not rendering it at allk

Restrict niche calculation to a subset of the data.
table_key
Key in `spatialdata.tables` to specify an 'anndata' table. Only necessary if 'sdata' is passed.
mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the use case for this?

Required if flavor == 'neighborhood'.
n_neighbors
Number of neighbors to use for 'scanpy.pp.neighbors' before clustering using leiden algorithm.
Required if flavor == 'neighborhood' or flavor == 'UTAG'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the constants also here

Comment on lines +134 to +138
if resolutions is not None:
if not isinstance(resolutions, list):
resolutions = [resolutions]
else:
raise ValueError("Please provide resolutions for leiden clustering.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be resolutions = [resolutions] if not isinstance(resolutions, list) else resolutions but then you'd also have to validate the args for this specific niche flavour. Using spearate functions, these checks could be done before



def _utag(adata: AnnData, normalize_adj: bool, spatial_connectivity_key: str) -> AnnData:
"""Performs inner product of adjacency matrix and feature matrix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs formatting, please do

"""
...
"""

return float(-(distribution * np.log(distribution + 1e-8)).sum())


def _iter_uid(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstring to this function

Returns
The Shannon entropy
"""
return float(-(distribution * np.log(distribution + 1e-8)).sum())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both numpy and scipy implement entropy, why is this required?

return float(np.pad(f1_scores, (0, n_classes - len(f1_scores))).mean())


def jensen_shannon_divergence(adatas: AnnData | list[AnnData], library_key: str, slide_key: str | None = None) -> float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giovp
Copy link
Member

giovp commented Oct 28, 2024

also on this question

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.

yes, and yes. If neighbor calculation is run with multiple slides, then a block diagonal matrix of the spatial neighbor is returned, in fact treating the slides to be independent.

I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?

Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.

yes agree! You can take a look at the Parallel module we have now and just reuse it

Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

absolutely please log everything you see fit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
graph 🕸️ squidpy2.0 Everything releated to a Squidpy 2.0 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants