Merge pull request #97 from JuliaAI/dev

For a 0.1.6 Release
JuliaAI · Mar 15, 2024 · 122ef71 · 122ef71
2 parents 7ff8c3a + a5debb5
commit 122ef71
Show file tree

Hide file tree

Showing 11 changed files with 52 additions and 26 deletions.
diff --git a/README.md b/README.md
@@ -34,29 +34,34 @@ The package implements the following resampling algorithms
 - Tomek Links Undersampling
 - Balanced Bagging Classifier (@MLJBalancing.jl)
 
-To see various examples where such methods help improve classification performance, check the [tutorials sections](https://juliaai.github.io/Imbalance.jl/dev/examples/) of the documentation.
+To see various examples where such methods help improve classification performance, check the [tutorials section](https://juliaai.github.io/Imbalance.jl/dev/examples/) of the documentation.
 
 Interested in contributing with more? Check [this](https://juliaai.github.io/Imbalance.jl/dev/contributing/).
 
 ## 🚀 Quick Start
 
 We will illustrate using the package to oversample with`SMOTE`; however, all other implemented oversampling methods follow the same pattern.
 
+Let's start by generating some dummy imbalanced data:
 
-### 🔵 Standard API
-All methods by default support a pure functional interface.
 ```julia
 using Imbalance
 
 # Set dataset properties then generate imbalanced data
 class_probs = [0.5, 0.2, 0.3]                  # probability of each class      
 num_rows, num_continuous_feats = 100, 5
-X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)      
+X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)
+```
+In following code blocks, it will be assumed that `X` and `y` are readily available.
+
+### 🔵 Standard API
+All methods by default support a pure functional interface.
+```julia
+using Imbalance
 
 # Apply SMOTE to oversample the classes
 Xover, yover = smote(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
 ```
-In following code blocks, it will be assumed that `X` and `y` are readily available.
 
 ### 🤖 MLJ Interface
 All methods support the [`MLJ` interface](https://alan-turing-institute.github.io/MLJ.jl/dev/) where instead of directly calling the method, one instantiates a model for the method while optionally passing the keyword parameters found in the functional interface then wraps the model in a `machine` and follows by calling `transform` on the machine and data.
@@ -81,7 +86,7 @@ All implemented oversampling methods are considered static transforms and hence,
 If [MLJBalancing](https://github.com/JuliaAI/MLJBalancing.jl) is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).
 
 ```julia
-using MLJBalancing
+using MLJ, MLJBalancing
 
 # grab two resamplers and a classifier
 LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0

diff --git a/docs/src/examples/effect_of_s/effect_of_s.ipynb b/docs/src/examples/effect_of_s/effect_of_s.ipynb
@@ -365,7 +365,7 @@
    "id": "0edce49b",
    "metadata": {},
    "source": [
-    "Let's go for a [Decision Tree](https://alan-turing-institute.github.io/MLJ.jl/dev/models/DecisionTreeClassifier_DecisionTree/#DecisionTreeClassifier_DecisionTree). This is just like the normal perceptron but it learns the separating hyperplane in a higher dimensional space using the kernel trick so that it corresponds to a nonlinear separating hypersurface in the original space. This isn't necessarily helpful in our case, but just to experiment."
+    "Let's go for a [BayesianLDA](https://alan-turing-institute.github.io/MLJ.jl/dev/models/BayesianLDA_MultivariateStats/#BayesianLDA_MultivariateStats). "
    ]
   },
   {

diff --git a/docs/src/examples/effect_of_s/effect_of_s.md b/docs/src/examples/effect_of_s/effect_of_s.md
@@ -160,7 +160,7 @@ models(matching(Xover, yover))
      (name = XGBoostClassifier, package_name = XGBoost, ... )
 
 
-Let's go for a [Decision Tree](https://alan-turing-institute.github.io/MLJ.jl/dev/models/DecisionTreeClassifier_DecisionTree/#DecisionTreeClassifier_DecisionTree). This is just like the normal perceptron but it learns the separating hyperplane in a higher dimensional space using the kernel trick so that it corresponds to a nonlinear separating hypersurface in the original space. This isn't necessarily helpful in our case, but just to experiment.
+Let's go for a [BayesianLDA](https://alan-turing-institute.github.io/MLJ.jl/dev/models/BayesianLDA_MultivariateStats/#BayesianLDA_MultivariateStats).
 
 
 ```julia

diff --git a/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb b/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb
@@ -453,7 +453,7 @@
    "id": "4e909c8b",
    "metadata": {},
    "source": [
-    "Let's go for a logistic classifier form MLJLinearModels"
+    "Let's go for a decision tree classifier from [BetaML](https://github.com/sylvaticus/BetaML.jl)"
    ]
   },
   {

diff --git a/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.md b/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.md
@@ -216,16 +216,12 @@ ms = models(matching(Xover, yover))
      (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
      (name = RandomForestClassifier, package_name = BetaML, ... )
 
-
-Let's go for a logistic classifier form MLJLinearModels
-
+We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features. Let's go hence for a decision tree classifier from [BetaML](https://github.com/sylvaticus/BetaML.jl). 
 
 ```julia
 import Pkg; Pkg.add("BetaML")
 ```
 
-Let's go for a decision tree from BetaML. We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features.
-
 ### Before Oversampling
 
 

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -36,21 +36,26 @@ Interested in contributing with more? Check [this](https://juliaai.github.io/Imb
 
 We will illustrate using the package to oversample with`SMOTE`; however, all other implemented oversampling methods follow the same pattern.
 
-### Standard API
-All methods by default support a pure functional interface.
+Let's start by generating some dummy imbalanced data:
+
 ```julia
 using Imbalance
 
 # Set dataset properties then generate imbalanced data
 class_probs = [0.5, 0.2, 0.3]                  # probability of each class      
 num_rows, num_continuous_feats = 100, 5
-X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)      
+X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)
+```
+In following code blocks, it will be assumed that `X` and `y` are readily available.
+
+### Standard API
+All methods by default support a pure functional interface.
+```julia
+using Imbalance
 
 # Apply SMOTE to oversample the classes
 Xover, yover = smote(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
-
 ```
-In following code blocks, it will be assumed that `X` and `y` are readily available.
 
 ### MLJ Interface
 All methods support the [`MLJ` interface](https://alan-turing-institute.github.io/MLJ.jl/dev/) where instead of directly calling the method, one instantiates a model for the method while optionally passing the keyword parameters found in the functional interface then wraps the model in a `machine` and follows by calling `transform` on the machine and data.
@@ -75,7 +80,7 @@ All implemented oversampling methods are considered static transforms and hence,
 If `MLJBalancing` is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).
 
 ```julia
-using MLJBalancing
+using MLJ, MLJBalancing
 
 # grab two resamplers and a classifier
 LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0

diff --git a/src/generic_resample.jl b/src/generic_resample.jl
@@ -23,7 +23,7 @@ $(COMMON_DOCS["RATIOS"])
 	new labels generated by the oversampling method for each class.
 """
 function generic_oversample(
-    X::AbstractMatrix{<:Real},
+    X::AbstractMatrix{<:Union{Real, Missing}},
     y::AbstractVector,
     oversample_per_class,
     args...;
@@ -83,7 +83,7 @@ This function is a generic implementation of undersampling methods that apply so
 	- `y_under`: An abstract vector of class labels corresponding to `X_under`
 """
 function generic_undersample(
-    X::AbstractMatrix{<:Real},
+    X::AbstractMatrix{<:Union{Real, Missing}},
     y::AbstractVector,
     undersample_per_class,
     args...;

diff --git a/src/oversampling_methods/random_oversample/random_oversample.jl b/src/oversampling_methods/random_oversample/random_oversample.jl
@@ -10,7 +10,7 @@ generate n new observations for that class using random oversampling
 - `Xnew`: A matrix where each column is a new observation generated by ROSE
 """
 function random_oversample_per_class(
-    X::AbstractMatrix{<:Real},
+    X::AbstractMatrix{<:Union{Real, Missing}},
     n::Integer;
     rng::AbstractRNG = default_rng(),
 )
@@ -129,12 +129,13 @@ A full basic example along with an animation can be found [here](https://githubt
     section which also explains running code on Google Colab.
 """
 function random_oversample(
-    X::AbstractMatrix{<:Real},
+    X::AbstractMatrix{<:Union{Real, Missing}},
     y::AbstractVector;
     ratios = 1.0,
     rng::Union{AbstractRNG,Integer} = default_rng(),
     try_preserve_type::Bool = true,
 )
+    println("lolss")
     rng = rng_handler(rng)
     Xover, yover = generic_oversample(X, y, random_oversample_per_class; ratios, rng,)
     return Xover, yover

diff --git a/src/undersampling_methods/random_undersample/random_undersample.jl b/src/undersampling_methods/random_undersample/random_undersample.jl
@@ -10,7 +10,7 @@ randomly remove n observations for that class using random undersampling
 - `Xnew`: A matrix containing the undersampled observations
 """
 function random_undersample_per_class(
-    X::AbstractMatrix{<:Real},
+    X::AbstractMatrix{<:Union{Real, Missing}},
     n::Integer;
     rng::AbstractRNG = default_rng(),
 )
@@ -130,7 +130,7 @@ A full basic example along with an animation can be found [here](https://githubt
     section which also explains running code on Google Colab.
 """
 function random_undersample(
-    X::AbstractMatrix{<:Real},
+    X::AbstractMatrix{<:Union{Real, Missing}},
     y::AbstractVector;
     ratios = 1.0,
     rng::Union{AbstractRNG, Integer} = default_rng(),

diff --git a/test/oversampling/random_oversample.jl b/test/oversampling/random_oversample.jl
@@ -21,6 +21,16 @@ using Imbalance: random_oversample
     )
     # Check that the number of uniques is the same
     @test length(unique(Xover, dims = 1)) == length(unique(X, dims = 1))
+
+    y = ["A", "A", "B", "A", "B"]
+    X = [1 1.1 2.1;
+        1 1.2 2.2;
+        2 1.3 2.3;
+        1 1.4 missing;
+        2 1.5 2.5; ]
+    Xover, yover = random_oversample(X, y) 
+    @test length(unique(Xover, dims = 1)) == length(unique(X, dims = 1))
+
 end
 
 

diff --git a/test/undersampling/random_undersample.jl b/test/undersampling/random_undersample.jl
@@ -21,6 +21,15 @@ using Imbalance: random_undersample
     )
     # Check that X_under is a subset of X
     @test issubset(Set(eachrow(X_under)), Set(eachrow(X)))
+
+    y = ["A", "A", "B", "A", "B"]
+    X = [1 1.1 2.1;
+        1 1.2 2.2;
+        2 1.3 2.3;
+        1 1.4 missing;
+        2 1.5 2.5; ]
+    X_under, y_under = random_undersample(X, y) 
+    @test issubset(Set(eachrow(X_under)), Set(eachrow(X)))
 end
 
 # test that the materializer works for dataframes