Improvement/fix for optimal_radius_dbscan (WIP) (#254)

KalelR · Datseris · web-flow · commit 9658c5027801 · 2022-05-10T23:19:38.000+02:00
* Add changes improving optimal_radius_dbscan; should be no more errors now, and improved the estimation of attractors, but still needs further tests and improvements

* remove ChaosTools. from code

* Fix comments

* Removed bug warning and added min-max rescaling feature to unsupervised clustering method

* added knee/elbow method for estiamting optimal radius in dbscan; added clustering/utils file with the method, the silhouette method, and some other utils for dbscan. Directory should also include the source code from Clustering.jl, but dbscan errors when I do that. I'll try to fix later, and use Clustering.jl directly for now

* implemented support for knee method; moved utilities and optimal radius to clustering/utils

* add utils.jl

* Fix tests for unsupervised clustering. Two problems were occuring, and were related to the labeling of the attractors: (i) algorithm might return labels [1,2] when correct is [-1, 1] (this is because it identifies the Henon attractor at infinity as an attractor); (ii) algorithm might identify some outlier points (eg some of the points in the FP attractor in Lorenz84), in which case it puts them in -1. Then the labels are [-1, 1,2,3] and not [1,2,3]. In both cases, the  values were already within the tolerated error, so I just ignored the labels and tested the values.  Also had to increase  to , as it improved the clustering for Lorenz84.

* Use suggested values for duffing

* compare number of attractors, remove comparisons of keys themselves, replace featurizer's estimationi of period by the minimum of A[:,1]

* remove commented dependencies that should eventually be included in src/basins/clustering

* Increment minor version and update changelog

* quick fix changelog

Co-authored-by: Datseris &lt;datseris.george@gmail.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,7 @@
+# 2.9
+* Improved the `AttractorsViaFeaturizing` algorithm by improving the method for finding the optimal radius used in the clustering. This consisted in (i) maximizing the average silhouette values, instead of minimum (slight improvement), (ii) min-max rescaling the features for the clustering (big improvement); (iii) adding an alternative method ,called elbow method, that is faster but worse at clustering.  
+* Changed `attractor_mapping_tests.jl` to deal better with the Featurizing method. 
+
 # 2.8
 * Brand new `AttractorMapper` infrastructure. It is a generic framework for mapping initial conditions to attractors and hence calculating basins of attraction and related quantities. Existing originally disparate functionality has been brought together under this framework.
 * The old `basins_of_attraction` function has been completely deprecated in favor of using the version `basins_of_attraction(mapper::AttractorMapper, grid)`, which utilizes the new `AttractorMapper` interface and is more intuitive and generalizable.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "ChaosTools"
 uuid = "608a59af-f2a3-5ad4-90b4-758bdf3122a7"
 repo = "https://github.com/JuliaDynamics/ChaosTools.jl.git"
-version = "2.8.2"
+version = "2.9"
 
 [deps]
 Clustering = "aaaa29a8-35af-508c-8bc3-b662a17a0fe5"
diff --git a/src/ChaosTools.jl b/src/ChaosTools.jl
@@ -28,6 +28,7 @@ include("basins/basins_utilities.jl")
 include("basins/fractality_of_basins.jl")
 include("basins/tipping.jl")
 include("basins/sampler.jl")
+include("basins/clustering/utils.jl")
 
 include("dimensions/linear_regions.jl")
 include("dimensions/generalized_dim.jl")
diff --git a/src/basins/attractor_mapping_featurizing.jl b/src/basins/attractor_mapping_featurizing.jl
@@ -20,6 +20,8 @@ struct AttractorsViaFeaturizing{DS<:GeneralizedDynamicalSystem, T, F, A, K, M} <
     clust_method::String
     clustering_threshold::Float64
     min_neighbors::Int
+    rescale_features::Bool
+    optimal_radius_method::String
 end
 DynamicalSystemsBase.get_rule_for_print(m::AttractorsViaFeaturizing) =
 get_rule_for_print(m.ds)
@@ -38,86 +40,94 @@ end
 """
     AttractorsViaFeaturizing(ds::DynamicalSystem, featurizer::Function; kwargs...) → mapper
 
-Initialize a `mapper` that maps initial conditions
-to attractors using the featurizing and clustering method of [^Stender2021].
-See [`AttractorMapper`](@ref) for how to use the `mapper`.
+Initialize a `mapper` that maps initial conditions to attractors using the featurizing and
+clustering method of [^Stender2021]. See [`AttractorMapper`](@ref) for how to use the
+`mapper`.
 
-`featurizer` is a function that takes as an input an integrated trajectory `A::Dataset`
-and the corresponding time vector `t` and returns a `Vector{<:Real}` of features
-describing the trajectory.
+`featurizer` is a function that takes as an input an integrated trajectory `A::Dataset` and
+the corresponding time vector `t` and returns a `Vector{<:Real}` of features describing the
+trajectory.
 
 ## Keyword arguments
 ### Integration
 * `T=100, Ttr=100, Δt=1, diffeq=NamedTuple()`: Propagated to [`trajectory`](@ref).
 
 ### Feature extraction and classification
-* `attractors_ic = nothing` Enables supervised version, see below.
-  If given, must be a `Dataset` of initial conditions each leading to a different attractor.
-* `min_neighbors = 10`: (unsupervised method only) minimum number of neighbors
-  (i.e. of similar features) each feature needs to have in order to be considered in a
-  cluster (fewer than this, it is labeled as an outlier, `-1`).
+* `attractors_ic = nothing` Enables supervised version, see below. If given, must be a
+  `Dataset` of initial conditions each leading to a different attractor.
+* `min_neighbors = 10`: (unsupervised method only) minimum number of neighbors (i.e. of
+  similar features) each feature needs to have in order to be considered in a cluster (fewer
+  than this, it is labeled as an outlier, `-1`).
 * `clust_method_norm=Euclidean()`: metric to be used in the clustering.
-* `clustering_threshold = 0.0`: Maximum allowed distance between a feature and the
-  cluster center for it to be considered inside the cluster.
-  Only used when `clust_method = "kNN_thresholded"`.
-* `clust_method = clustering_threshold > 0 ? "kNN_thresholded" : "kNN"`:
-  (supervised method only) which clusterization method to
-  apply. If `"kNN"`, the first-neighbor clustering is used. If `"kNN_thresholded"`, a
-  subsequent step is taken, which considers as unclassified (label `-1`) the features
-  whose distance to the nearest template is above the `clustering_threshold`.
+* `clustering_threshold = 0.0`: Maximum allowed distance between a feature and the cluster
+  center for it to be considered inside the cluster. Only used when `clust_method =
+  "kNN_thresholded"`.
+* `clust_method = clustering_threshold > 0 ? "kNN_thresholded" : "kNN"`: (supervised method
+  only) which clusterization method to apply. If `"kNN"`, the first-neighbor clustering is
+  used. If `"kNN_thresholded"`, a subsequent step is taken, which considers as unclassified
+  (label `-1`) the features whose distance to the nearest template is above the
+  `clustering_threshold`.
+* `rescale_features = true`: (unsupervised method): if true, rescale each dimension of the
+extracted features separately into the range `[0,1]`.
+* `optimal_radius_method = silhouettes` (unsupervised method): the method used to determine
+the optimal radius for clustering features in the unsupervised method. The `silhouettes`
+    method chooses the radius that maximizes the average silhouette values of clusters, and
+    is an iterative optimization procedure that may take some time to execute. The `elbow`
+    method chooses the the radius according to the elbow (knee, highest-derivative method)
+    (see [`optimal_radius_dbscan_elbow`](@ref) for details), and is quicker though possibly
+    leads to worse clustering.
 
 ## Description
-The trajectory `X` of each initial condition is transformed into a vector of features.
-Each feature is a number useful in _characterizing the attractor_ the initial condition
-ends up at, and distinguishing it from other attrators. Example features are the mean or
-standard deviation of one of the of the timeseries of the trajectory,
-the entropy of the first two dimensions, the fractal dimension of `X`,
-or anything else you may fancy.
-The vectors of features are then used to identify to which attractor
-each trajectory belongs (i.e. in which basin of attractor each initial condition is in).
-The method thus relies on the user having at least some basic idea about what attractors
-to expect in order to pick the right features, in contrast to [`AttractorsViaRecurrences`](@ref).
-
-The algorithm of[^Stender2021] that we use has two versions to do this.
-If the attractors are not known a-priori the **unsupervised versions** should be used.
-Here, the vectors of features of each initial condition are mapped to an attractor by
-analysing how the features are clustered in the feature space. Using the DBSCAN algorithm,
-we identify these clusters of features, and consider each cluster to represent an
-attractor. Features whose attractor is not identified are labeled as `-1`.
+The trajectory `X` of each initial condition is transformed into a vector of features. Each
+feature is a number useful in _characterizing the attractor_ the initial condition ends up
+at, and distinguishing it from other attrators. Example features are the mean or standard
+deviation of one of the of the timeseries of the trajectory, the entropy of the first two
+dimensions, the fractal dimension of `X`, or anything else you may fancy. The vectors of
+features are then used to identify to which attractor each trajectory belongs (i.e. in which
+basin of attractor each initial condition is in). The method thus relies on the user having
+at least some basic idea about what attractors to expect in order to pick the right
+features, in contrast to [`AttractorsViaRecurrences`](@ref).
+
+The algorithm of[^Stender2021] that we use has two versions to do this. If the attractors
+are not known a-priori the **unsupervised versions** should be used. Here, the vectors of
+features of each initial condition are mapped to an attractor by analysing how the features
+are clustered in the feature space. Using the DBSCAN algorithm, we identify these clusters
+of features, and consider each cluster to represent an attractor. Features whose attractor
+is not identified are labeled as `-1`. If each feature spans different scales of magnitude,
+rescaling them into the same `[0,1]` interval can bring significant improvements in the
+clustering in case the `Euclidean` distance metric is used.   
 
 In the **supervised version**, the attractors are known to the user, who provides one
-initial condition for each attractor using the `attractors_ic` keyword.
-The algorithm then evolves these initial conditions, extracts their features, and uses them
-as templates representing the attrators. Each trajectory is considered to belong to the
-nearest template (based on the distance in feature space).
-Notice that the functionality of this version is similar to [`AttractorsViaProximity`](@ref).
-Generally speaking, the [`AttractorsViaProximity`](@ref) is superior. However, if the
-dynamical system has extremely high-dimensionality, there may be reasons to use the
-supervised method of this featurizing algorithm instead.
+initial condition for each attractor using the `attractors_ic` keyword. The algorithm then
+evolves these initial conditions, extracts their features, and uses them as templates
+representing the attrators. Each trajectory is considered to belong to the nearest template
+(based on the distance in feature space). Notice that the functionality of this version is
+similar to [`AttractorsViaProximity`](@ref). Generally speaking, the
+[`AttractorsViaProximity`](@ref) is superior. However, if the dynamical system has extremely
+high-dimensionality, there may be reasons to use the supervised method of this featurizing
+algorithm instead.
 
 ## Parallelization note
-The trajectories in this method are integrated in parallel using `Threads`.
-To enable this, simply start Julia with the number of threads you want to use.
+The trajectories in this method are integrated in parallel using `Threads`. To enable this,
+simply start Julia with the number of threads you want to use.
 
-[^Stender2021]:
-    Stender & Hoffmann, [bSTAB: an open-source software for computing the basin
+[^Stender2021]: Stender & Hoffmann, [bSTAB: an open-source software for computing the basin
     stability of multi-stable dynamical systems](https://doi.org/10.1007/s11071-021-06786-5)
 """
 function AttractorsViaFeaturizing(ds::GeneralizedDynamicalSystem, featurizer::Function;
         attractors_ic::Union{AbstractDataset, Nothing}=nothing, T=100, Ttr=100, Δt=1,
         clust_method_norm=Euclidean(),
         clustering_threshold = 0.0, min_neighbors = 10, diffeq = NamedTuple(),
-        clust_method = clustering_threshold > 0 ? "kNN_thresholded" : "kNN",
+        clust_method = clustering_threshold > 0 ? "kNN_thresholded" : "kNN", 
+        rescale_features=true, optimal_radius_method="silhouettes",
     )
-    if isnothing(attractors_ic)
-        @warn "Unsupervised clustering algorithm is currently bugged and may not identify all clusters."
-    end
     if ds isa ContinuousDynamicalSystem
         T, Ttr, Δt = float.((T, Ttr, Δt))
     end
     return AttractorsViaFeaturizing(
         ds, Ttr, Δt, T, featurizer, attractors_ic, diffeq,
-        clust_method_norm, clust_method, Float64(clustering_threshold), min_neighbors
+        clust_method_norm, clust_method, Float64(clustering_threshold), min_neighbors,
+        rescale_features, optimal_radius_method
     )
 end
 
@@ -140,7 +150,7 @@ end
 function extract_features(mapper::AttractorsViaFeaturizing, ics::Union{AbstractDataset, Function};
     show_progress = true, N = 1000)
 
-    N = (typeof(ics) <: Function)  ? N : size(ics, 1) #number of actual ICs
+    N = (typeof(ics) <: Function)  ? N : size(ics, 1) # number of actual ICs
 
     feature_array = Vector{Vector{Float64}}(undef, N)
     if show_progress
@@ -175,7 +185,8 @@ function classify_features(features, mapper::AttractorsViaFeaturizing)
     if !isnothing(mapper.attractors_ic)
         classify_features_distances(features, mapper)
     else
-        classify_features_clustering(features, mapper.min_neighbors, mapper.clust_method_norm)
+        classify_features_clustering(features, mapper.min_neighbors, mapper.clust_method_norm,
+        mapper.rescale_features, mapper.optimal_radius_method)
     end
 end
 
@@ -198,13 +209,25 @@ function classify_features_distances(features, mapper)
     return class_labels, class_errors
 end
 
+"""
+Does "min-max" rescaling of vector `vec`: rescales it such that its values span `[0,1]`.
+"""
+function rescale(vec::Vector{T}) where T
+    vec .-= minimum(vec)
+    max = maximum(vec)
+    if max == 0 return zeros(T, length(vec)) end
+    vec ./= maximum(vec)
+end
+
 # Unsupervised method: clustering in feature space
-function classify_features_clustering(features, min_neighbors, metric)
-    ϵ_optimal = optimal_radius_dbscan(features, min_neighbors, metric)
+function classify_features_clustering(features, min_neighbors, metric, rescale_features,
+     optimal_radius_method)
+    if rescale_features features = mapslices(rescale, features, dims=2) end
+    ϵ_optimal = optimal_radius_dbscan(features, min_neighbors, metric, optimal_radius_method)
     # Now recalculate the final clustering with the optimal ϵ
-    clusters = Clustering.dbscan(features, ϵ_optimal; min_neighbors)
+    clusters = dbscan(features, ϵ_optimal; min_neighbors)
     clusters, sizes = sort_clusters_calc_size(clusters)
-    class_labels = cluster_props(clusters, features; include_boundary=false)
+    class_labels = cluster_assignment(clusters, features; include_boundary=false)
     # number of real clusters (size above minimum points);
     # this is also the number of "templates"
     k = length(sizes[sizes .> min_neighbors])
@@ -220,61 +243,3 @@ function classify_features_clustering(features, min_neighbors, metric)
     return class_labels, class_errors
 end
 
-#####################################################################################
-# Utilities
-#####################################################################################
-"""
-Util function for `classify_features`. It returns the size of all the DBSCAN clusters and the
-assignment vector, in which the i-th component is the cluster index of the i-th feature
-"""
-function cluster_props(clusters, data; include_boundary=true)
-    assign = zeros(Int, size(data)[2])
-    for (idx, cluster) in enumerate(clusters)
-        assign[cluster.core_indices] .= idx
-        if cluster.boundary_indices != []
-            if include_boundary
-                assign[cluster.boundary_indices] .= idx
-            else
-                assign[cluster.boundary_indices] .= -1
-            end
-        end
-    end
-    return assign
-end
-
-"""
-Util function for `classify_features`. Calculates the clusters' (DbscanCluster) size
-and sorts them in decreasing order according to the size.
-"""
-function sort_clusters_calc_size(clusters)
-    sizes = [cluster.size for cluster in clusters]
-    idxsort = sortperm(sizes; rev = true)
-    return clusters[idxsort], sizes[idxsort]
-end
-
-"""
-Find the optimal radius ε of a point neighborhood for use in DBSCAN, in the unsupervised
-`classify_features`. It does so by finding the `ε` which maximizes the minimum silhouette
-of the cluster.
-"""
-function optimal_radius_dbscan(features, min_neighbors, metric)
-    feat_ranges = maximum(features, dims=2)[:,1] .- minimum(features, dims=2)[:,1];
-    ϵ_grid = range(minimum(feat_ranges)/200, minimum(feat_ranges), length=200)
-    s_grid = zeros(size(ϵ_grid)) # min silhouette values (which we want to maximize)
-
-    #vary ϵ to find the best one (which will maximize the minimum sillhoute)
-    for i=1:length(ϵ_grid)
-        clusters = dbscan(features, ϵ_grid[i]; min_neighbors)
-        dists = pairwise(metric, features)
-        class_labels = cluster_props(clusters, features)
-        if length(clusters) ≠ 1 #silhouette undefined if only one cluster.
-            sils = silhouettes(class_labels, dists) #values == 0 are due to boundary points
-            s_grid[i] = minimum(sils[sils .!= 0.0]) #minimum silhouette value of core points
-        else
-            s_grid[i] = -2; #this would effecively ignore the single-cluster solution
-        end
-    end
-
-    max, idx = findmax(s_grid)
-    ϵ_optimal = ϵ_grid[idx]
-end
diff --git a/src/basins/clustering/utils.jl b/src/basins/clustering/utils.jl
diff --git a/test/basins/attractor_mapping_tests.jl b/test/basins/attractor_mapping_tests.jl