Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port sample/feature selection from equisolve to metatensor-learn #560

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
248 changes: 248 additions & 0 deletions python/metatensor-learn/metatensor/learn/selection/_selection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
from typing import Type, Union

import numpy as np
import skmatter._selection

import metatensor

from .._backend import Labels, TensorBlock, TensorMap


class GreedySelector:
"""
Wraps :py:class:`skmatter._selection.GreedySelector` for a TensorMap.

The class creates a selector for each block. The selection will be done based the
values of each :py:class:`TensorBlock`. Gradients will not be considered for the
selection.
"""

def __init__(
self,
selector_class: Type[skmatter._selection.GreedySelector],
selection_type: str,
n_to_select: Union[int, dict],
**selector_arguments,
) -> None:
self._selector_class = selector_class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are dispatching based on this class, I would assert it is one of the expected class.

self._selection_type = selection_type
self._n_to_select = n_to_select
self._selector_arguments = selector_arguments

self._selector_arguments["selection_type"] = self._selection_type
self._support = None
self._select_distance = None

@property
def selector_class(self) -> Type[skmatter._selection.GreedySelector]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to re-export all these properties?

Copy link
Contributor

@PicoCentauri PicoCentauri Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to. We could have a very slim version that only exposes the fit and transform, fit_transform , inverse_transform.

I added them in the past to be API compatible with the numpy versions in skmatter. But especially selector_class seems a bit superflous now that I read this again.

"""
The class to perform the selection. Usually one of 'FPS' or 'CUR'.
"""
return self._selector_class

@property
def selection_type(self) -> str:
"""
Whether to choose a subset of columns ('feature') or rows ('sample').
"""
return self._selection_type

@property
def selector_arguments(self) -> dict:
"""
Arguments passed to the ``selector_class``.
"""
return self._selector_arguments

@property
def support(self) -> TensorMap:
"""
TensorMap containing the support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this property, it should have more information on what the support is, and what metadata the TensorMap contains

"""
if self._support is None:
raise ValueError("No selections. Call fit method first.")

return self._support

@property
def get_select_distance(self) -> TensorMap:
"""
Returns a TensorMap containing the Hausdorff distances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the Hausdorff distance only used for FPS? Why is this function defined at the base class level instead of the FPS class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I assume it is the same in skmatter but we can move this only the FPS classes or create a new base class that only the fps selectors inherit to avoid duplicating this into two classes.


For each block, the metadata of the relevant axis (i.e. samples or properties,
depending on whether sample or feature selection is being performed) is sorted
and returned according to the Hausdorff distance, in descending order.
Comment on lines +72 to +74
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear. If I do sample selection, I get a tensor with "haussdorf distance" as property, no components and the samples sorted?

How does the output looks like for property selection? I would go for the same thing, except using the input tensor properties as the samples of the get_select_distance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically you just swap samples and properties. Without looking into the code again a assume you get one sample "haussdorf distance" no components and properties sorted.

"""
if self._selector_class == skmatter._selection._CUR:
raise ValueError("Hausdorff distances not available for CUR in skmatter.")
if self._select_distance is None:
raise ValueError("No Hausdorff distances. Call fit method first.")

return self._select_distance

def fit(self, X: TensorMap, warm_start: bool = False) -> None:
"""
Learn the features to select.

:param X: the input training vectors to fit.
:param warm_start: bool, whether the fit should continue after having already
run, after increasing `n_to_select`. Assumes it is called with the same X.
"""
# Check that we have only 0 or 1 comoponent axes
if len(X.component_names) == 0:
has_components = False
elif len(X.component_names) == 1:
has_components = True
else:
assert len(X.component_names) > 1
raise ValueError("Can only handle TensorMaps with a single component axis.")

support_blocks = []
if self._selector_class == skmatter._selection._FPS:
hausdorff_blocks = []
for key, block in X.items():
# Parse the n_to_select argument
max_n = (
len(block.properties)
if self._selection_type == "feature"
else len(block.samples)
)
if isinstance(self._n_to_select, int):
if (
self._n_to_select == -1
): # set to the number of samples/features for this block
tmp_n_to_select = max_n
else:
tmp_n_to_select = self._n_to_select

elif isinstance(self._n_to_select, dict):
tmp_n_to_select = self._n_to_select[tuple(key.values)]
else:
raise ValueError("n_to_select must be an int or a dict.")

if not (0 < tmp_n_to_select <= max_n):
raise ValueError(
f"n_to_select ({tmp_n_to_select}) must > 0 and <= the number of "
f"{self._selection_type} for the given block ({max_n})."
)

selector = self.selector_class(
n_to_select=tmp_n_to_select, **self.selector_arguments
)

# If the block has components, reshape to a 2D array such that the
# components expand along the dimension *not* being selected.
block_vals = block.values
if has_components:
n_components = len(block.components[0])
if self._selection_type == "feature":
# Move components into samples
block_vals = block_vals.reshape(
(block_vals.shape[0] * n_components, block_vals.shape[2])
)
else:
assert self._selection_type == "sample"
# Move components into features
block_vals = block.values.reshape(
(block_vals.shape[0], block_vals.shape[2] * n_components)
)

# Fit on the block values
selector.fit(block_vals, warm_start=warm_start)

# Build the support TensorMap. In this case we want the mask to be a
# list of bools, such that the original order of the metadata is
# preserved.
supp_mask = selector.get_support()
if self._selection_type == "feature":
supp_samples = Labels.single()
supp_properties = Labels(
names=block.properties.names,
values=block.properties.values[supp_mask],
)
elif self._selection_type == "sample":
supp_samples = Labels(
names=block.samples.names, values=block.samples.values[supp_mask]
)
supp_properties = Labels.single()

supp_vals = np.zeros(
[len(supp_samples), len(supp_properties)], dtype=np.int32
)
support_blocks.append(
TensorBlock(
values=supp_vals,
samples=supp_samples,
components=[],
properties=supp_properties,
)
)

if self._selector_class == skmatter._selection._FPS:
# Build the Hausdorff TensorMap, only for FPS. In this case we want the
# mask to be a list of int such that the samples/properties are
# reordered according to the Hausdorff distance.
haus_mask = selector.get_support(indices=True, ordered=True)
if self._selection_type == "feature":
haus_samples = Labels.single()
haus_properties = Labels(
names=block.properties.names,
values=block.properties.values[haus_mask],
)
elif self._selection_type == "sample":
haus_samples = Labels(
names=block.samples.names,
values=block.samples.values[haus_mask],
)
haus_properties = Labels.single()

haus_vals = selector.hausdorff_at_select_[haus_mask].reshape(
len(haus_samples), len(haus_properties)
)
hausdorff_blocks.append(
TensorBlock(
values=haus_vals,
samples=haus_samples,
components=[],
properties=haus_properties,
)
)

self._support = TensorMap(X.keys, support_blocks)
if self._selector_class == skmatter._selection._FPS:
self._select_distance = TensorMap(X.keys, hausdorff_blocks)

return self

def transform(self, X: TensorMap) -> TensorMap:
"""
Reduce X to the selected features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reduce X to the selected features.
Reduce X to the selected samples/features.


:param X: the input tensor.
:returns: the selected subset of the input.
"""
blocks = []
for key, block in X.items():
block_support = self.support.block(key)

if self._selection_type == "feature":
new_block = metatensor.slice_block(
block, "properties", block_support.properties
)
elif self._selection_type == "sample":
new_block = metatensor.slice_block(
block, "samples", block_support.samples
)
blocks.append(new_block)

return TensorMap(X.keys, blocks)

def fit_transform(self, X: TensorMap, warm_start: bool = False) -> TensorMap:
"""
Fit to data, then transform it.

:param X: TensorMap of the training vectors.
:param warm_start: bool, whether the fit should continue after having already
run, after increasing `n_to_select`. Assumes it is called with the same X.
"""
return self.fit(X, warm_start=warm_start).transform(X)
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""
Wrappers for the feature selectors of `scikit-matter`_.

.. _`scikit-matter`: https://scikit-matter.readthedocs.io/en/latest/selection.html
"""

from skmatter._selection import _CUR, _FPS

from ._selection import GreedySelector


class FPS(GreedySelector):
"""
Transformer that performs Greedy Feature Selection using Farthest Point Sampling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite unclear to a user ...

Suggested change
Transformer that performs Greedy Feature Selection using Farthest Point Sampling.
Perform feature selection using Farthest Point Sampling (FPS).

I would also add a link to some FPS paper or a short explanation of what this is doing. Same thing for all other selector classes!


If `n_to_select` is an `int`, all blocks will have this many features selected. In
this case, `n_to_select` must be <= than the fewest number of features in any block.

If `n_to_select` is a dict, it must have keys that are tuples corresponding to the
key values of each block. In this case, the values of the `n_to_select` dict can be
int that specify different number of features to select for each block.
Comment on lines +19 to +21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to use a dict here? Or should we do the same trick as the other classes (i.e. Labels + List[T])?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Labels are the keys of the blocks? and why a List[Tensors] and not List[int]?

Copy link
Contributor

@Luthaf Luthaf Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant List[int], the T above was used as a template parameter! (this is basically LabelsDict[int])


If `n_to_select` is -1, all features for every block will be selected. This is
useful, for instance, for plotting Hausdorff distances, which can be accessed
through the selector.haussdorf_at_select property after calling the fit() method.

Refer to :py:class:`skmatter.feature_selection.FPS` for full documentation.
"""

def __init__(
self,
initialize=0,
n_to_select=None,
score_threshold=None,
score_threshold_type="absolute",
progress_bar=False,
full=False,
random_state=0,
):
super().__init__(
selector_class=_FPS,
selection_type="feature",
initialize=initialize,
n_to_select=n_to_select,
score_threshold=score_threshold,
score_threshold_type=score_threshold_type,
progress_bar=progress_bar,
full=full,
random_state=random_state,
)


class CUR(GreedySelector):
"""
Transformer that performs Greedy Feature Selection with CUR.

If `n_to_select` is an `int`, all blocks will have this many features selected. In
this case, `n_to_select` must be <= than the fewest number of features in any block.

If `n_to_select` is a dict, it must have keys that are tuples corresponding to the
key values of each block. In this case, the values of the `n_to_select` dict can be
int that specify different number of features to select for each block.

If `n_to_select` is -1, all features for every block will be selected.

Refer to :py:class:`skmatter.feature_selection.CUR` for full documentation.
"""

def __init__(
self,
recompute_every=1,
k=1,
tolerance=1e-12,
n_to_select=None,
score_threshold=None,
score_threshold_type="absolute",
progress_bar=False,
full=False,
random_state=0,
):
super().__init__(
selector_class=_CUR,
selection_type="feature",
recompute_every=recompute_every,
k=k,
tolerance=tolerance,
n_to_select=n_to_select,
score_threshold=score_threshold,
score_threshold_type=score_threshold_type,
progress_bar=progress_bar,
full=full,
random_state=random_state,
)
Loading
Loading