-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port sample/feature selection from equisolve to metatensor-learn
#560
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,248 @@ | ||||||
from typing import Type, Union | ||||||
|
||||||
import numpy as np | ||||||
import skmatter._selection | ||||||
|
||||||
import metatensor | ||||||
|
||||||
from .._backend import Labels, TensorBlock, TensorMap | ||||||
|
||||||
|
||||||
class GreedySelector: | ||||||
""" | ||||||
Wraps :py:class:`skmatter._selection.GreedySelector` for a TensorMap. | ||||||
|
||||||
The class creates a selector for each block. The selection will be done based the | ||||||
values of each :py:class:`TensorBlock`. Gradients will not be considered for the | ||||||
selection. | ||||||
""" | ||||||
|
||||||
def __init__( | ||||||
self, | ||||||
selector_class: Type[skmatter._selection.GreedySelector], | ||||||
selection_type: str, | ||||||
n_to_select: Union[int, dict], | ||||||
**selector_arguments, | ||||||
) -> None: | ||||||
self._selector_class = selector_class | ||||||
self._selection_type = selection_type | ||||||
self._n_to_select = n_to_select | ||||||
self._selector_arguments = selector_arguments | ||||||
|
||||||
self._selector_arguments["selection_type"] = self._selection_type | ||||||
self._support = None | ||||||
self._select_distance = None | ||||||
|
||||||
@property | ||||||
def selector_class(self) -> Type[skmatter._selection.GreedySelector]: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we need to re-export all these properties? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't have to. We could have a very slim version that only exposes the I added them in the past to be API compatible with the numpy versions in skmatter. But especially |
||||||
""" | ||||||
The class to perform the selection. Usually one of 'FPS' or 'CUR'. | ||||||
""" | ||||||
return self._selector_class | ||||||
|
||||||
@property | ||||||
def selection_type(self) -> str: | ||||||
""" | ||||||
Whether to choose a subset of columns ('feature') or rows ('sample'). | ||||||
""" | ||||||
return self._selection_type | ||||||
|
||||||
@property | ||||||
def selector_arguments(self) -> dict: | ||||||
""" | ||||||
Arguments passed to the ``selector_class``. | ||||||
""" | ||||||
return self._selector_arguments | ||||||
|
||||||
@property | ||||||
def support(self) -> TensorMap: | ||||||
""" | ||||||
TensorMap containing the support. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we keep this property, it should have more information on what the support is, and what metadata the TensorMap contains |
||||||
""" | ||||||
if self._support is None: | ||||||
raise ValueError("No selections. Call fit method first.") | ||||||
|
||||||
return self._support | ||||||
|
||||||
@property | ||||||
def get_select_distance(self) -> TensorMap: | ||||||
""" | ||||||
Returns a TensorMap containing the Hausdorff distances. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't the Hausdorff distance only used for FPS? Why is this function defined at the base class level instead of the FPS class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is. I assume it is the same in skmatter but we can move this only the FPS classes or create a new base class that only the fps selectors inherit to avoid duplicating this into two classes. |
||||||
|
||||||
For each block, the metadata of the relevant axis (i.e. samples or properties, | ||||||
depending on whether sample or feature selection is being performed) is sorted | ||||||
and returned according to the Hausdorff distance, in descending order. | ||||||
Comment on lines
+72
to
+74
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not very clear. If I do sample selection, I get a tensor with "haussdorf distance" as property, no components and the samples sorted? How does the output looks like for property selection? I would go for the same thing, except using the input tensor properties as the samples of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Basically you just swap samples and properties. Without looking into the code again a assume you get one sample "haussdorf distance" no components and properties sorted. |
||||||
""" | ||||||
if self._selector_class == skmatter._selection._CUR: | ||||||
raise ValueError("Hausdorff distances not available for CUR in skmatter.") | ||||||
if self._select_distance is None: | ||||||
raise ValueError("No Hausdorff distances. Call fit method first.") | ||||||
|
||||||
return self._select_distance | ||||||
|
||||||
def fit(self, X: TensorMap, warm_start: bool = False) -> None: | ||||||
""" | ||||||
Learn the features to select. | ||||||
|
||||||
:param X: the input training vectors to fit. | ||||||
:param warm_start: bool, whether the fit should continue after having already | ||||||
run, after increasing `n_to_select`. Assumes it is called with the same X. | ||||||
""" | ||||||
# Check that we have only 0 or 1 comoponent axes | ||||||
if len(X.component_names) == 0: | ||||||
has_components = False | ||||||
elif len(X.component_names) == 1: | ||||||
has_components = True | ||||||
else: | ||||||
assert len(X.component_names) > 1 | ||||||
raise ValueError("Can only handle TensorMaps with a single component axis.") | ||||||
|
||||||
support_blocks = [] | ||||||
if self._selector_class == skmatter._selection._FPS: | ||||||
hausdorff_blocks = [] | ||||||
for key, block in X.items(): | ||||||
# Parse the n_to_select argument | ||||||
max_n = ( | ||||||
len(block.properties) | ||||||
if self._selection_type == "feature" | ||||||
else len(block.samples) | ||||||
) | ||||||
if isinstance(self._n_to_select, int): | ||||||
if ( | ||||||
self._n_to_select == -1 | ||||||
): # set to the number of samples/features for this block | ||||||
tmp_n_to_select = max_n | ||||||
else: | ||||||
tmp_n_to_select = self._n_to_select | ||||||
|
||||||
elif isinstance(self._n_to_select, dict): | ||||||
tmp_n_to_select = self._n_to_select[tuple(key.values)] | ||||||
else: | ||||||
raise ValueError("n_to_select must be an int or a dict.") | ||||||
|
||||||
if not (0 < tmp_n_to_select <= max_n): | ||||||
raise ValueError( | ||||||
f"n_to_select ({tmp_n_to_select}) must > 0 and <= the number of " | ||||||
f"{self._selection_type} for the given block ({max_n})." | ||||||
) | ||||||
|
||||||
selector = self.selector_class( | ||||||
n_to_select=tmp_n_to_select, **self.selector_arguments | ||||||
) | ||||||
|
||||||
# If the block has components, reshape to a 2D array such that the | ||||||
# components expand along the dimension *not* being selected. | ||||||
block_vals = block.values | ||||||
if has_components: | ||||||
n_components = len(block.components[0]) | ||||||
if self._selection_type == "feature": | ||||||
# Move components into samples | ||||||
block_vals = block_vals.reshape( | ||||||
(block_vals.shape[0] * n_components, block_vals.shape[2]) | ||||||
) | ||||||
else: | ||||||
assert self._selection_type == "sample" | ||||||
# Move components into features | ||||||
block_vals = block.values.reshape( | ||||||
(block_vals.shape[0], block_vals.shape[2] * n_components) | ||||||
) | ||||||
|
||||||
# Fit on the block values | ||||||
selector.fit(block_vals, warm_start=warm_start) | ||||||
|
||||||
# Build the support TensorMap. In this case we want the mask to be a | ||||||
# list of bools, such that the original order of the metadata is | ||||||
# preserved. | ||||||
supp_mask = selector.get_support() | ||||||
if self._selection_type == "feature": | ||||||
supp_samples = Labels.single() | ||||||
supp_properties = Labels( | ||||||
names=block.properties.names, | ||||||
values=block.properties.values[supp_mask], | ||||||
) | ||||||
elif self._selection_type == "sample": | ||||||
supp_samples = Labels( | ||||||
names=block.samples.names, values=block.samples.values[supp_mask] | ||||||
) | ||||||
supp_properties = Labels.single() | ||||||
|
||||||
supp_vals = np.zeros( | ||||||
[len(supp_samples), len(supp_properties)], dtype=np.int32 | ||||||
) | ||||||
support_blocks.append( | ||||||
TensorBlock( | ||||||
values=supp_vals, | ||||||
samples=supp_samples, | ||||||
components=[], | ||||||
properties=supp_properties, | ||||||
) | ||||||
) | ||||||
|
||||||
if self._selector_class == skmatter._selection._FPS: | ||||||
# Build the Hausdorff TensorMap, only for FPS. In this case we want the | ||||||
# mask to be a list of int such that the samples/properties are | ||||||
# reordered according to the Hausdorff distance. | ||||||
haus_mask = selector.get_support(indices=True, ordered=True) | ||||||
if self._selection_type == "feature": | ||||||
haus_samples = Labels.single() | ||||||
haus_properties = Labels( | ||||||
names=block.properties.names, | ||||||
values=block.properties.values[haus_mask], | ||||||
) | ||||||
elif self._selection_type == "sample": | ||||||
haus_samples = Labels( | ||||||
names=block.samples.names, | ||||||
values=block.samples.values[haus_mask], | ||||||
) | ||||||
haus_properties = Labels.single() | ||||||
|
||||||
haus_vals = selector.hausdorff_at_select_[haus_mask].reshape( | ||||||
len(haus_samples), len(haus_properties) | ||||||
) | ||||||
hausdorff_blocks.append( | ||||||
TensorBlock( | ||||||
values=haus_vals, | ||||||
samples=haus_samples, | ||||||
components=[], | ||||||
properties=haus_properties, | ||||||
) | ||||||
) | ||||||
|
||||||
self._support = TensorMap(X.keys, support_blocks) | ||||||
if self._selector_class == skmatter._selection._FPS: | ||||||
self._select_distance = TensorMap(X.keys, hausdorff_blocks) | ||||||
|
||||||
return self | ||||||
|
||||||
def transform(self, X: TensorMap) -> TensorMap: | ||||||
""" | ||||||
Reduce X to the selected features. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
:param X: the input tensor. | ||||||
:returns: the selected subset of the input. | ||||||
""" | ||||||
blocks = [] | ||||||
for key, block in X.items(): | ||||||
block_support = self.support.block(key) | ||||||
|
||||||
if self._selection_type == "feature": | ||||||
new_block = metatensor.slice_block( | ||||||
block, "properties", block_support.properties | ||||||
) | ||||||
elif self._selection_type == "sample": | ||||||
new_block = metatensor.slice_block( | ||||||
block, "samples", block_support.samples | ||||||
) | ||||||
blocks.append(new_block) | ||||||
|
||||||
return TensorMap(X.keys, blocks) | ||||||
|
||||||
def fit_transform(self, X: TensorMap, warm_start: bool = False) -> TensorMap: | ||||||
""" | ||||||
Fit to data, then transform it. | ||||||
|
||||||
:param X: TensorMap of the training vectors. | ||||||
:param warm_start: bool, whether the fit should continue after having already | ||||||
run, after increasing `n_to_select`. Assumes it is called with the same X. | ||||||
""" | ||||||
return self.fit(X, warm_start=warm_start).transform(X) |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,93 @@ | ||||||
""" | ||||||
Wrappers for the feature selectors of `scikit-matter`_. | ||||||
|
||||||
.. _`scikit-matter`: https://scikit-matter.readthedocs.io/en/latest/selection.html | ||||||
""" | ||||||
|
||||||
from skmatter._selection import _CUR, _FPS | ||||||
|
||||||
from ._selection import GreedySelector | ||||||
|
||||||
|
||||||
class FPS(GreedySelector): | ||||||
""" | ||||||
Transformer that performs Greedy Feature Selection using Farthest Point Sampling. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is quite unclear to a user ...
Suggested change
I would also add a link to some FPS paper or a short explanation of what this is doing. Same thing for all other selector classes! |
||||||
|
||||||
If `n_to_select` is an `int`, all blocks will have this many features selected. In | ||||||
this case, `n_to_select` must be <= than the fewest number of features in any block. | ||||||
|
||||||
If `n_to_select` is a dict, it must have keys that are tuples corresponding to the | ||||||
key values of each block. In this case, the values of the `n_to_select` dict can be | ||||||
int that specify different number of features to select for each block. | ||||||
Comment on lines
+19
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really want to use a dict here? Or should we do the same trick as the other classes (i.e. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant |
||||||
|
||||||
If `n_to_select` is -1, all features for every block will be selected. This is | ||||||
useful, for instance, for plotting Hausdorff distances, which can be accessed | ||||||
through the selector.haussdorf_at_select property after calling the fit() method. | ||||||
|
||||||
Refer to :py:class:`skmatter.feature_selection.FPS` for full documentation. | ||||||
""" | ||||||
|
||||||
def __init__( | ||||||
self, | ||||||
initialize=0, | ||||||
n_to_select=None, | ||||||
score_threshold=None, | ||||||
score_threshold_type="absolute", | ||||||
progress_bar=False, | ||||||
full=False, | ||||||
random_state=0, | ||||||
): | ||||||
super().__init__( | ||||||
selector_class=_FPS, | ||||||
selection_type="feature", | ||||||
initialize=initialize, | ||||||
n_to_select=n_to_select, | ||||||
score_threshold=score_threshold, | ||||||
score_threshold_type=score_threshold_type, | ||||||
progress_bar=progress_bar, | ||||||
full=full, | ||||||
random_state=random_state, | ||||||
) | ||||||
|
||||||
|
||||||
class CUR(GreedySelector): | ||||||
""" | ||||||
Transformer that performs Greedy Feature Selection with CUR. | ||||||
|
||||||
If `n_to_select` is an `int`, all blocks will have this many features selected. In | ||||||
this case, `n_to_select` must be <= than the fewest number of features in any block. | ||||||
|
||||||
If `n_to_select` is a dict, it must have keys that are tuples corresponding to the | ||||||
key values of each block. In this case, the values of the `n_to_select` dict can be | ||||||
int that specify different number of features to select for each block. | ||||||
|
||||||
If `n_to_select` is -1, all features for every block will be selected. | ||||||
|
||||||
Refer to :py:class:`skmatter.feature_selection.CUR` for full documentation. | ||||||
""" | ||||||
|
||||||
def __init__( | ||||||
self, | ||||||
recompute_every=1, | ||||||
k=1, | ||||||
tolerance=1e-12, | ||||||
n_to_select=None, | ||||||
score_threshold=None, | ||||||
score_threshold_type="absolute", | ||||||
progress_bar=False, | ||||||
full=False, | ||||||
random_state=0, | ||||||
): | ||||||
super().__init__( | ||||||
selector_class=_CUR, | ||||||
selection_type="feature", | ||||||
recompute_every=recompute_every, | ||||||
k=k, | ||||||
tolerance=tolerance, | ||||||
n_to_select=n_to_select, | ||||||
score_threshold=score_threshold, | ||||||
score_threshold_type=score_threshold_type, | ||||||
progress_bar=progress_bar, | ||||||
full=full, | ||||||
random_state=random_state, | ||||||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are dispatching based on this class, I would assert it is one of the expected class.