diff --git a/README.md b/README.md index 170fa168..8363b0a3 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ The package implements the following resampling algorithms - Tomek Links Undersampling - Balanced Bagging Classifier (@MLJBalancing.jl) -To see various examples where such methods help improve classification performance, check the [tutorials sections](https://juliaai.github.io/Imbalance.jl/dev/examples/) of the documentation. +To see various examples where such methods help improve classification performance, check the [tutorials section](https://juliaai.github.io/Imbalance.jl/dev/examples/) of the documentation. Interested in contributing with more? Check [this](https://juliaai.github.io/Imbalance.jl/dev/contributing/). @@ -42,21 +42,26 @@ Interested in contributing with more? Check [this](https://juliaai.github.io/Imb We will illustrate using the package to oversample with`SMOTE`; however, all other implemented oversampling methods follow the same pattern. +Let's start by generating some dummy imbalanced data: -### 🔵 Standard API -All methods by default support a pure functional interface. ```julia using Imbalance # Set dataset properties then generate imbalanced data class_probs = [0.5, 0.2, 0.3] # probability of each class num_rows, num_continuous_feats = 100, 5 -X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42) +X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42) +``` +In following code blocks, it will be assumed that `X` and `y` are readily available. + +### 🔵 Standard API +All methods by default support a pure functional interface. +```julia +using Imbalance # Apply SMOTE to oversample the classes Xover, yover = smote(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42) ``` -In following code blocks, it will be assumed that `X` and `y` are readily available. ### 🤖 MLJ Interface All methods support the [`MLJ` interface](https://alan-turing-institute.github.io/MLJ.jl/dev/) where instead of directly calling the method, one instantiates a model for the method while optionally passing the keyword parameters found in the functional interface then wraps the model in a `machine` and follows by calling `transform` on the machine and data. @@ -81,7 +86,7 @@ All implemented oversampling methods are considered static transforms and hence, If [MLJBalancing](https://github.com/JuliaAI/MLJBalancing.jl) is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction). ```julia -using MLJBalancing +using MLJ, MLJBalancing # grab two resamplers and a classifier LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0 diff --git a/docs/src/examples/effect_of_s/effect_of_s.ipynb b/docs/src/examples/effect_of_s/effect_of_s.ipynb index 62f2dc62..e9404f2c 100644 --- a/docs/src/examples/effect_of_s/effect_of_s.ipynb +++ b/docs/src/examples/effect_of_s/effect_of_s.ipynb @@ -365,7 +365,7 @@ "id": "0edce49b", "metadata": {}, "source": [ - "Let's go for a [Decision Tree](https://alan-turing-institute.github.io/MLJ.jl/dev/models/DecisionTreeClassifier_DecisionTree/#DecisionTreeClassifier_DecisionTree). This is just like the normal perceptron but it learns the separating hyperplane in a higher dimensional space using the kernel trick so that it corresponds to a nonlinear separating hypersurface in the original space. This isn't necessarily helpful in our case, but just to experiment." + "Let's go for a [BayesianLDA](https://alan-turing-institute.github.io/MLJ.jl/dev/models/BayesianLDA_MultivariateStats/#BayesianLDA_MultivariateStats). " ] }, { diff --git a/docs/src/examples/effect_of_s/effect_of_s.md b/docs/src/examples/effect_of_s/effect_of_s.md index 2df7fac3..25183afe 100644 --- a/docs/src/examples/effect_of_s/effect_of_s.md +++ b/docs/src/examples/effect_of_s/effect_of_s.md @@ -160,7 +160,7 @@ models(matching(Xover, yover)) (name = XGBoostClassifier, package_name = XGBoost, ... ) -Let's go for a [Decision Tree](https://alan-turing-institute.github.io/MLJ.jl/dev/models/DecisionTreeClassifier_DecisionTree/#DecisionTreeClassifier_DecisionTree). This is just like the normal perceptron but it learns the separating hyperplane in a higher dimensional space using the kernel trick so that it corresponds to a nonlinear separating hypersurface in the original space. This isn't necessarily helpful in our case, but just to experiment. +Let's go for a [BayesianLDA](https://alan-turing-institute.github.io/MLJ.jl/dev/models/BayesianLDA_MultivariateStats/#BayesianLDA_MultivariateStats). ```julia diff --git a/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb b/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb index 09bae9c0..277b4663 100644 --- a/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb +++ b/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb @@ -453,7 +453,7 @@ "id": "4e909c8b", "metadata": {}, "source": [ - "Let's go for a logistic classifier form MLJLinearModels" + "Let's go for a decision tree classifier from [BetaML](https://github.com/sylvaticus/BetaML.jl)" ] }, { diff --git a/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.md b/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.md index 25f19e60..7d023291 100644 --- a/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.md +++ b/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.md @@ -216,16 +216,12 @@ ms = models(matching(Xover, yover)) (name = DeterministicConstantClassifier, package_name = MLJModels, ... ) (name = RandomForestClassifier, package_name = BetaML, ... ) - -Let's go for a logistic classifier form MLJLinearModels - +We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features. Let's go hence for a decision tree classifier from [BetaML](https://github.com/sylvaticus/BetaML.jl). ```julia import Pkg; Pkg.add("BetaML") ``` -Let's go for a decision tree from BetaML. We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features. - ### Before Oversampling diff --git a/docs/src/index.md b/docs/src/index.md index ef375ebf..2a05b44e 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -36,21 +36,26 @@ Interested in contributing with more? Check [this](https://juliaai.github.io/Imb We will illustrate using the package to oversample with`SMOTE`; however, all other implemented oversampling methods follow the same pattern. -### Standard API -All methods by default support a pure functional interface. +Let's start by generating some dummy imbalanced data: + ```julia using Imbalance # Set dataset properties then generate imbalanced data class_probs = [0.5, 0.2, 0.3] # probability of each class num_rows, num_continuous_feats = 100, 5 -X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42) +X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42) +``` +In following code blocks, it will be assumed that `X` and `y` are readily available. + +### Standard API +All methods by default support a pure functional interface. +```julia +using Imbalance # Apply SMOTE to oversample the classes Xover, yover = smote(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42) - ``` -In following code blocks, it will be assumed that `X` and `y` are readily available. ### MLJ Interface All methods support the [`MLJ` interface](https://alan-turing-institute.github.io/MLJ.jl/dev/) where instead of directly calling the method, one instantiates a model for the method while optionally passing the keyword parameters found in the functional interface then wraps the model in a `machine` and follows by calling `transform` on the machine and data. @@ -75,7 +80,7 @@ All implemented oversampling methods are considered static transforms and hence, If `MLJBalancing` is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction). ```julia -using MLJBalancing +using MLJ, MLJBalancing # grab two resamplers and a classifier LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0 diff --git a/src/generic_resample.jl b/src/generic_resample.jl index e411791b..923cd257 100644 --- a/src/generic_resample.jl +++ b/src/generic_resample.jl @@ -23,7 +23,7 @@ $(COMMON_DOCS["RATIOS"]) new labels generated by the oversampling method for each class. """ function generic_oversample( - X::AbstractMatrix{<:Real}, + X::AbstractMatrix{<:Union{Real, Missing}}, y::AbstractVector, oversample_per_class, args...; @@ -83,7 +83,7 @@ This function is a generic implementation of undersampling methods that apply so - `y_under`: An abstract vector of class labels corresponding to `X_under` """ function generic_undersample( - X::AbstractMatrix{<:Real}, + X::AbstractMatrix{<:Union{Real, Missing}}, y::AbstractVector, undersample_per_class, args...; diff --git a/src/oversampling_methods/random_oversample/random_oversample.jl b/src/oversampling_methods/random_oversample/random_oversample.jl index eefb4917..9d5f7345 100644 --- a/src/oversampling_methods/random_oversample/random_oversample.jl +++ b/src/oversampling_methods/random_oversample/random_oversample.jl @@ -10,7 +10,7 @@ generate n new observations for that class using random oversampling - `Xnew`: A matrix where each column is a new observation generated by ROSE """ function random_oversample_per_class( - X::AbstractMatrix{<:Real}, + X::AbstractMatrix{<:Union{Real, Missing}}, n::Integer; rng::AbstractRNG = default_rng(), ) @@ -129,12 +129,13 @@ A full basic example along with an animation can be found [here](https://githubt section which also explains running code on Google Colab. """ function random_oversample( - X::AbstractMatrix{<:Real}, + X::AbstractMatrix{<:Union{Real, Missing}}, y::AbstractVector; ratios = 1.0, rng::Union{AbstractRNG,Integer} = default_rng(), try_preserve_type::Bool = true, ) + println("lolss") rng = rng_handler(rng) Xover, yover = generic_oversample(X, y, random_oversample_per_class; ratios, rng,) return Xover, yover diff --git a/src/undersampling_methods/random_undersample/random_undersample.jl b/src/undersampling_methods/random_undersample/random_undersample.jl index 4bfd564d..439c23cb 100644 --- a/src/undersampling_methods/random_undersample/random_undersample.jl +++ b/src/undersampling_methods/random_undersample/random_undersample.jl @@ -10,7 +10,7 @@ randomly remove n observations for that class using random undersampling - `Xnew`: A matrix containing the undersampled observations """ function random_undersample_per_class( - X::AbstractMatrix{<:Real}, + X::AbstractMatrix{<:Union{Real, Missing}}, n::Integer; rng::AbstractRNG = default_rng(), ) @@ -130,7 +130,7 @@ A full basic example along with an animation can be found [here](https://githubt section which also explains running code on Google Colab. """ function random_undersample( - X::AbstractMatrix{<:Real}, + X::AbstractMatrix{<:Union{Real, Missing}}, y::AbstractVector; ratios = 1.0, rng::Union{AbstractRNG, Integer} = default_rng(), diff --git a/test/oversampling/random_oversample.jl b/test/oversampling/random_oversample.jl index c69e3eb1..b0d7de6b 100644 --- a/test/oversampling/random_oversample.jl +++ b/test/oversampling/random_oversample.jl @@ -21,6 +21,16 @@ using Imbalance: random_oversample ) # Check that the number of uniques is the same @test length(unique(Xover, dims = 1)) == length(unique(X, dims = 1)) + + y = ["A", "A", "B", "A", "B"] + X = [1 1.1 2.1; + 1 1.2 2.2; + 2 1.3 2.3; + 1 1.4 missing; + 2 1.5 2.5; ] + Xover, yover = random_oversample(X, y) + @test length(unique(Xover, dims = 1)) == length(unique(X, dims = 1)) + end diff --git a/test/undersampling/random_undersample.jl b/test/undersampling/random_undersample.jl index e992a313..5f2488c0 100644 --- a/test/undersampling/random_undersample.jl +++ b/test/undersampling/random_undersample.jl @@ -21,6 +21,15 @@ using Imbalance: random_undersample ) # Check that X_under is a subset of X @test issubset(Set(eachrow(X_under)), Set(eachrow(X))) + + y = ["A", "A", "B", "A", "B"] + X = [1 1.1 2.1; + 1 1.2 2.2; + 2 1.3 2.3; + 1 1.4 missing; + 2 1.5 2.5; ] + X_under, y_under = random_undersample(X, y) + @test issubset(Set(eachrow(X_under)), Set(eachrow(X))) end # test that the materializer works for dataframes