Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cocycles feature #23

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

MonkeyBreaker
Copy link
Collaborator

@MonkeyBreaker MonkeyBreaker commented Aug 16, 2021

This PR allows users to retrieve cocycles from the Persistent Homology computation.

  • : C++ backend compute cocycles during computation
  • : Update bindings to enable retrieving cocycles
  • : Update Python interface to support retrieving cocycles with argument do_cocycles
    • : Should we use a different name than do_cocycles ?
  • : Add test for verify that the feature works as expected
    • : Add test when cocycles is produced by a persistence pair
    • : Add test when cocycles is produced by an essential pair
  • : Add a notebook with an example of using cocycles with maybe some interaction with a plot for a nice visualization.
  • : Update documentation about this new feature

@MonkeyBreaker MonkeyBreaker added the enhancement New feature or request label Aug 16, 2021
Copy link
Collaborator

@ulupo ulupo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first quick look at ripser_bindings.cpp

gph/bindings/ripser_bindings.cpp Outdated Show resolved Hide resolved
gph/bindings/ripser_bindings.cpp Outdated Show resolved Hide resolved
gph/bindings/ripser_bindings.cpp Outdated Show resolved Hide resolved
gph/bindings/ripser_bindings.cpp Outdated Show resolved Hide resolved
@ulupo
Copy link
Collaborator

ulupo commented Aug 16, 2021

Should we use a different name than do_cocycles ?

Even if longer, I would slightly prefer return_cocycles. Reason: one can "do" a lot of computation without returning anything.

@MonkeyBreaker
Copy link
Collaborator Author

MonkeyBreaker commented Aug 16, 2021

Should we use a different name than do_cocycles ?

Even if longer, I would slightly prefer return_cocycles. Reason: one can "do" a lot of computation without returning anything.

Sure, should it be an argument of function ripser_parallel or should we pass it via kwargs ?
Lets go with return_cocycles, sorry for the noise

@ulupo
Copy link
Collaborator

ulupo commented Aug 16, 2021

Sure, should it be an argument of function ripser_parallel or should we pass it via kwargs ?

Are you asking if it should be positional or keyword-based with a default? I would strongly support keyword-based with a default of False, as you have already done.

gph/python/ripser_interface.py Outdated Show resolved Hide resolved
gph/python/ripser_interface.py Outdated Show resolved Hide resolved
gph/python/ripser_interface.py Outdated Show resolved Hide resolved
gph/src/ripser.h Outdated Show resolved Hide resolved
gph/test/test_ripser.py Outdated Show resolved Hide resolved
@CLAassistant
Copy link

CLAassistant commented Jan 5, 2022

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ ulupo
❌ MonkeyBreaker
You have signed the CLA already but the status is still pending? Let us recheck it.

julian and others added 15 commits August 17, 2022 16:49
* Enable feature in C++ backend
* Cannot enable the computation in Python right now
* A lot of changes happened in the bindings, but in reality is not much, I forgot to run previously clang-format

Signed-off-by: julian <[email protected]>
Co-authored-by: Umberto Lupo <[email protected]>
This allows that finite cocycles are sorted with the corresponding finite barcode

Signed-off-by: julian <[email protected]>
Remove check for first insertion for pivot_to_death_idx hash table

Signed-off-by: julian <[email protected]>
Signed-off-by: julian <[email protected]>
@MonkeyBreaker
Copy link
Collaborator Author

@ulupo updated !

Comment on lines +47 to +48
using namespace pybind11::literals;
m.doc() = "Ripser python interface";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this indentation intentionally 2 spaces instead of 4?

.def_readwrite("flag_persistence_generators_by_dim",
&ripserResults::flag_persistence_generators);

m.def(
"rips_dm",
[](py::array_t<value_t>& D, py::array_t<value_t>& diag, int modulus,
int dim_max, float threshold, int num_threads,
bool return_generators) {
bool return_generators, const bool do_cocycles) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation: 3 spaces?

@michiexile
Copy link

Hey everyone!

I'm pretty sure I'll end up being able to provide a nice notebook that showcases the cocycles feature. My first attempt is in the attached ipynb, where the resulting circular coordinate ends up not being great (it should visualize with all the colors of the rainbow...)
I have not had time yet to figure out where things go wrong in computing here. The Python ripser version returns a canonical permutation of the distances in a field dperm2all that we use there, and I have not figured out whether something like that would be needed here as well

demo circular coordinates in giotto-ph.ipynb.zip
.

@ulupo
Copy link
Collaborator

ulupo commented Aug 26, 2022

Thanks a lot @michiexile! Do you think there is a bug in the cocycles implementation in this PR, or in the necessary post-processing to produce the graphics?

@michiexile
Copy link

Thanks a lot @michiexile! Do you think there is a bug in the cocycles implementation in this PR, or in the necessary post-processing to produce the graphics?

I honestly don't know, and I haven't had time to dig into it. The post-processing, we've been using relatively successfully with ripser and almost identical code. That's why I was talking about dperm2all above - ripser does some permutations to speed things up, and "reports back" by returning a permuted distance matrix. If you too are doing similar permutations, they would need to be reported back for my script to do the right thing.

@ulupo
Copy link
Collaborator

ulupo commented Aug 26, 2022

I see. Since giotto-ph does not include greedy permutations, I'd assume things should work just as they would in ripser.py when greedy permutations are not used there. Would this be a simple test for you to perform?

should have been signed instead of unsigned

Signed-off-by: julian <[email protected]>
@MonkeyBreaker
Copy link
Collaborator Author

Hi @michiexile,

Thank you so much for helping us with this. I looked at your notebook (thank you for sharing it with us ;)), at I compared the results with the cocycles obtained with ripser.py.
It seems I messed up the representation (signed/unsigned) in the back-end, sorry for that.
I "fixed" this, and I am now able to obtain the same cocycles, the rainbow looks really nice !

Regarding:

That's why I was talking about dperm2all above - ripser does some permutations to speed things up, and "reports back" by returning a permuted distance matrix. If you too are doing similar permutations, they would need to be reported back for my script to do the right thing.

If I am not mistaken, the permutations in ripser.py only happens if n_perm parameter is set. Otherwise dperm2all will contain the distance matrix used during the computation. Which in our case we decide to let the user compute it separately if needed (@ulupo maybe we should add it as output ?).

Again thank you so much to help us, please let me know if I can do anything here ;)

@michiexile
Copy link

It seems I messed up the representation (signed/unsigned) in the back-end, sorry for that.
I "fixed" this, and I am now able to obtain the same cocycles, the rainbow looks really nice !

Great! Thank you!!

If I am not mistaken, the permutations in ripser.py only happens if n_perm parameter is set. Otherwise dperm2all will contain the distance matrix used during the computation. Which in our case we decide to let the user compute it separately if needed (@ulupo maybe we should add it as output ?).

It's not clear to me what you might want to do with an explicit cocycle that does not involve the explicit boundary matrix in one form or another - in my example code, we use the distance matrix to create the relevant boundary matrix. Maybe add them (as many of "distance matrix" and "boundary matrix" as you can provide) as optional outputs?

@ulupo
Copy link
Collaborator

ulupo commented Aug 30, 2022

@MonkeyBreaker thanks for chasing the issue and fixing it!

@michiexile returning the distance matrix (as ripser.py does) would be very simple to implement. One could either do it automatically when return_cocycles=True, or have another optional boolean parameter (something like return_distance_matrix).

@MonkeyBreaker
Copy link
Collaborator Author

Hey everyone,

Sorry I was unexpectedly busy this last week ...
But now I will try to push things forward with your help 😄 .

About returning the computed distance Matrix, we did it before and we remove it because of memory and performance implications if I remember right.
I don't have a really strong opinion on the matter, I don't think that adding an additional parameter for returning the computed the distance matrix is a good idea, make the interface more heavy. But, I am not sure neither if we should always return when cocycles are enable the distance matrix (prevent memory duplication if the user provided already a distance matrix to ripser interface).

What are your thoughts on this (@ulupo and @michiexile), if you think it makes sense to always return it when computing cocycles makes sense, then sure, lets do it.

Also, another point, @michiexile would you have an idea of an example of cocycles related to an essential pair (non-finite) ? Is that even possible ?

Thank you again for the help you provide us with :) !

@michiexile
Copy link

michiexile commented Sep 13, 2022 via email

@MonkeyBreaker
Copy link
Collaborator Author

MonkeyBreaker commented Sep 13, 2022

Shouldn't it be really easy to detect whether the user is providing a distance matrix or not, and only produce the distance matrix when the user asks for cocycles and has not provided a distance matrix themselves?

Sorry if I made it sound like it could be difficult. My concern here is if user provides an already computed distance matrix and we return it, that we don't duplicate memory but just forward (move) the memory. Per example, with a dataset of 50k points, it took on my machine like 10GB, if we are not careful, it could take 20GB when returning the distance matrix, because Python copies to return it.

I will try this week to prepare something here and check if we don't have this issue of memory duplication.

I have a hard time thinking of a situation where I would want the cocycles explicitly, but have no need for the distance matrix (or more specifically, no need for the VR-complex coboundary map). If you're willing to build the coboundary map for me, I can probably survive without a distance matrix pretty often.

I think the coboundary map could be interesting, but would require some heavy changes in c++ part. I would for the moment avoid this but if this might be useful for additional use cases, then, it would be part of another PR.

Ummmm... I would pretty much always work with essential pairs. Because non-essential pairs no longer produce cocycles at the radius you are looking at (because they have both a birth and a death time smaller than the "current" time)

Right, sorry I was confused, I don't have a background in mathematics, without @ulupo I am lost sometimes.

I hope this helps describe the usecase I'm interested in?

Definitely, I need to prepare a reproducible test where I always use the same input dataset and compute twice cocycles with a different threshold on each run. But, as Umberto already discussed with me, we would need to have access to the coboundary matrix, that, currently is only present in the c++ part...

Anyway, don't worry to much on this, I will still think about it.
On the meantime, I will try this week to provide the computed distance matrix as an output, do you have a suggestion of output parameter name ? Like ripser.py is calling it dperm2all, but in our case we don't yet perform any permutations.

Also, concerning your notebook, do you need additional features from our side ?

Best,
Julián

@ulupo
Copy link
Collaborator

ulupo commented Sep 13, 2022

I'm on vacation now and can only reply very briefly but this is just to say that returning the coboundary matrix may not be a terrible idea if it saves the user the extra effort. @MonkeyBreaker I already coded this for the unit tests that should be included in this PR and I included it in a notebook I shared with you a while back.

@michiexile is there something the distance matrix would provide beyond the ability to build the coboundary matrix yourself?

Sorry if I'm missing something due to the rush.

@MonkeyBreaker
Copy link
Collaborator Author

Hey,
Didn't have much time this week, and it doesn't seem I will have much next week.

I'm on vacation now and can only reply very briefly but this is just to say that returning the coboundary matrix may not be a terrible idea if it saves the user the extra effort. @MonkeyBreaker I already coded this for the unit tests that should be included in this PR and I included it in a notebook I shared with you a while back.

I think that returning the coboundary matrix would be a good idea, but not for this PR (or if we compute it in Python). The reason is that one of the tricks user in ripser to accelerate computation and also consume the smallest amount of memory is to not store the coboundary matrix. But to generate it on the fly, this means that introducing this feature could require a lot of heavy changes in the backend. But maybe I am misunderstanding something here, lets have a chat once you are back from holidays ;) .

Otherwise, returning the distance matrix if computed, only for when returning cocycles, I am fine with this. I think we have 3 cases:

  • No threshold applied from the user: We return the dense computed distance matrix.
  • A threshold was applied by the user: We return a coo_matrix((data, (row, col)) ? (or maybe a csr_matrix ?)
  • When weights enable, do we return before weight computation or the one after weight computation ?

@ulupo
Copy link
Collaborator

ulupo commented Sep 20, 2022

@MonkeyBreaker, thanks for the comments! What I meant to say initially is that we could compute the coboundary matrix in pure Python, as I already partly did that work for those unit tests I mentioned.

But now I think that, while those unit tests should be there, it would be a bit ambitious to construct the coboundary matrix in full generality. I only covered small homology dimensions in those tests, and only mod 2 coefficients. I could do it in full generality by including further dependencies (such as PHAT or GUDHI), but we probably shouldn't go there just yet.

So I also now agree that we should return the distance matrix only if the user wants cocycles. We should always return the distance matrix just before it is passed to the C++ backend. So, in the case with weights, the distance matrix after reweighting.

@michiexile
Copy link

michiexile commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants