Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port sample/feature selection from equisolve to metatensor-learn #560

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jwa7
Copy link
Member

@jwa7 jwa7 commented Mar 26, 2024

Closes #558.

Ports over the sample/feature selection tools (i.e. FPS and CUR) for TensorMaps, from equisolve to metatensor-learn.

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

Reviewer checklist

  • CHANGELOG updated with public API or any other important changes?

馃摎 Documentation preview 馃摎: https://metatensor--560.org.readthedocs.build/en/560/

@jwa7 jwa7 marked this pull request as ready for review March 26, 2024 19:26
Copy link

github-actions bot commented Mar 26, 2024

Here is a pre-built version of the code in this pull request: wheels.zip, you can install it locally by unzipping wheels.zip and using pip to install the file matching your system

@PicoCentauri
Copy link
Contributor

What is the status here? Code looks good as it is just ported. There seems to be an import issue though

E   ImportError: attempted relative import with no known parent package

@jwa7 jwa7 requested a review from PicoCentauri April 3, 2024 12:38
@PicoCentauri
Copy link
Contributor

For me the code looks good - Basically because I wrote back in good old equisolve times.

@Luthaf you want also take a look?

@jwa7 jwa7 requested a review from Luthaf April 3, 2024 12:55
Copy link
Contributor

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions needs to be added to the docs! And it would also be nice to have some tutorial on samples/feature selections.

@@ -0,0 +1,172 @@
import functools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid creating a bunch of utils just for this feature? Ideally this would use existing TensorMap fixtures/files from the repo.

@@ -136,6 +137,7 @@ deps =
{[testenv]packaging_deps}
{[testenv]testing_deps}
torch=={env:METATENSOR_TESTS_TORCH_VERSION:2.2.*}
skmatter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have one test environment without optional dependencies, to make sure we can run without them?

self._select_distance = None

@property
def selector_class(self) -> Type[skmatter._selection.GreedySelector]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to re-export all these properties?

Copy link
Contributor

@PicoCentauri PicoCentauri Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to. We could have a very slim version that only exposes the fit and transform, fit_transform , inverse_transform.

I added them in the past to be API compatible with the numpy versions in skmatter. But especially selector_class seems a bit superflous now that I read this again.

@property
def support(self) -> TensorMap:
"""
TensorMap containing the support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this property, it should have more information on what the support is, and what metadata the TensorMap contains

Comment on lines +72 to +74
For each block, the metadata of the relevant axis (i.e. samples or properties,
depending on whether sample or feature selection is being performed) is sorted
and returned according to the Hausdorff distance, in descending order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear. If I do sample selection, I get a tensor with "haussdorf distance" as property, no components and the samples sorted?

How does the output looks like for property selection? I would go for the same thing, except using the input tensor properties as the samples of the get_select_distance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically you just swap samples and properties. Without looking into the code again a assume you get one sample "haussdorf distance" no components and properties sorted.

@property
def get_select_distance(self) -> TensorMap:
"""
Returns a TensorMap containing the Hausdorff distances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the Hausdorff distance only used for FPS? Why is this function defined at the base class level instead of the FPS class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I assume it is the same in skmatter but we can move this only the FPS classes or create a new base class that only the fps selectors inherit to avoid duplicating this into two classes.

n_to_select: Union[int, dict],
**selector_arguments,
) -> None:
self._selector_class = selector_class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are dispatching based on this class, I would assert it is one of the expected class.


def transform(self, X: TensorMap) -> TensorMap:
"""
Reduce X to the selected features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reduce X to the selected features.
Reduce X to the selected samples/features.


class FPS(GreedySelector):
"""
Transformer that performs Greedy Feature Selection using Farthest Point Sampling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite unclear to a user ...

Suggested change
Transformer that performs Greedy Feature Selection using Farthest Point Sampling.
Perform feature selection using Farthest Point Sampling (FPS).

I would also add a link to some FPS paper or a short explanation of what this is doing. Same thing for all other selector classes!

Comment on lines +19 to +21
If `n_to_select` is a dict, it must have keys that are tuples corresponding to the
key values of each block. In this case, the values of the `n_to_select` dict can be
int that specify different number of features to select for each block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to use a dict here? Or should we do the same trick as the other classes (i.e. Labels + List[T])?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Labels are the keys of the blocks? and why a List[Tensors] and not List[int]?

Copy link
Contributor

@Luthaf Luthaf Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant List[int], the T above was used as a template parameter! (this is basically LabelsDict[int])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port equisolve's sample/feature selection to metatensor-learn
3 participants