Skip to content

Commit

Permalink
Merge pull request #97 from JuliaAI/dev
Browse files Browse the repository at this point in the history
For a 0.1.6 Release
  • Loading branch information
EssamWisam authored Mar 15, 2024
2 parents 7ff8c3a + a5debb5 commit 122ef71
Show file tree
Hide file tree
Showing 11 changed files with 52 additions and 26 deletions.
17 changes: 11 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,29 +34,34 @@ The package implements the following resampling algorithms
- Tomek Links Undersampling
- Balanced Bagging Classifier (@MLJBalancing.jl)

To see various examples where such methods help improve classification performance, check the [tutorials sections](https://juliaai.github.io/Imbalance.jl/dev/examples/) of the documentation.
To see various examples where such methods help improve classification performance, check the [tutorials section](https://juliaai.github.io/Imbalance.jl/dev/examples/) of the documentation.

Interested in contributing with more? Check [this](https://juliaai.github.io/Imbalance.jl/dev/contributing/).

## 🚀 Quick Start

We will illustrate using the package to oversample with`SMOTE`; however, all other implemented oversampling methods follow the same pattern.

Let's start by generating some dummy imbalanced data:

### 🔵 Standard API
All methods by default support a pure functional interface.
```julia
using Imbalance

# Set dataset properties then generate imbalanced data
class_probs = [0.5, 0.2, 0.3] # probability of each class
num_rows, num_continuous_feats = 100, 5
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)
```
In following code blocks, it will be assumed that `X` and `y` are readily available.

### 🔵 Standard API
All methods by default support a pure functional interface.
```julia
using Imbalance

# Apply SMOTE to oversample the classes
Xover, yover = smote(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
```
In following code blocks, it will be assumed that `X` and `y` are readily available.

### 🤖 MLJ Interface
All methods support the [`MLJ` interface](https://alan-turing-institute.github.io/MLJ.jl/dev/) where instead of directly calling the method, one instantiates a model for the method while optionally passing the keyword parameters found in the functional interface then wraps the model in a `machine` and follows by calling `transform` on the machine and data.
Expand All @@ -81,7 +86,7 @@ All implemented oversampling methods are considered static transforms and hence,
If [MLJBalancing](https://github.com/JuliaAI/MLJBalancing.jl) is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).

```julia
using MLJBalancing
using MLJ, MLJBalancing

# grab two resamplers and a classifier
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
Expand Down
2 changes: 1 addition & 1 deletion docs/src/examples/effect_of_s/effect_of_s.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@
"id": "0edce49b",
"metadata": {},
"source": [
"Let's go for a [Decision Tree](https://alan-turing-institute.github.io/MLJ.jl/dev/models/DecisionTreeClassifier_DecisionTree/#DecisionTreeClassifier_DecisionTree). This is just like the normal perceptron but it learns the separating hyperplane in a higher dimensional space using the kernel trick so that it corresponds to a nonlinear separating hypersurface in the original space. This isn't necessarily helpful in our case, but just to experiment."
"Let's go for a [BayesianLDA](https://alan-turing-institute.github.io/MLJ.jl/dev/models/BayesianLDA_MultivariateStats/#BayesianLDA_MultivariateStats). "
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/src/examples/effect_of_s/effect_of_s.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ models(matching(Xover, yover))
(name = XGBoostClassifier, package_name = XGBoost, ... )


Let's go for a [Decision Tree](https://alan-turing-institute.github.io/MLJ.jl/dev/models/DecisionTreeClassifier_DecisionTree/#DecisionTreeClassifier_DecisionTree). This is just like the normal perceptron but it learns the separating hyperplane in a higher dimensional space using the kernel trick so that it corresponds to a nonlinear separating hypersurface in the original space. This isn't necessarily helpful in our case, but just to experiment.
Let's go for a [BayesianLDA](https://alan-turing-institute.github.io/MLJ.jl/dev/models/BayesianLDA_MultivariateStats/#BayesianLDA_MultivariateStats).


```julia
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@
"id": "4e909c8b",
"metadata": {},
"source": [
"Let's go for a logistic classifier form MLJLinearModels"
"Let's go for a decision tree classifier from [BetaML](https://github.com/sylvaticus/BetaML.jl)"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -216,16 +216,12 @@ ms = models(matching(Xover, yover))
(name = DeterministicConstantClassifier, package_name = MLJModels, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )


Let's go for a logistic classifier form MLJLinearModels

We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features. Let's go hence for a decision tree classifier from [BetaML](https://github.com/sylvaticus/BetaML.jl).

```julia
import Pkg; Pkg.add("BetaML")
```

Let's go for a decision tree from BetaML. We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features.

### Before Oversampling


Expand Down
17 changes: 11 additions & 6 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,21 +36,26 @@ Interested in contributing with more? Check [this](https://juliaai.github.io/Imb

We will illustrate using the package to oversample with`SMOTE`; however, all other implemented oversampling methods follow the same pattern.

### Standard API
All methods by default support a pure functional interface.
Let's start by generating some dummy imbalanced data:

```julia
using Imbalance

# Set dataset properties then generate imbalanced data
class_probs = [0.5, 0.2, 0.3] # probability of each class
num_rows, num_continuous_feats = 100, 5
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; class_probs, rng=42)
```
In following code blocks, it will be assumed that `X` and `y` are readily available.

### Standard API
All methods by default support a pure functional interface.
```julia
using Imbalance

# Apply SMOTE to oversample the classes
Xover, yover = smote(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)

```
In following code blocks, it will be assumed that `X` and `y` are readily available.

### MLJ Interface
All methods support the [`MLJ` interface](https://alan-turing-institute.github.io/MLJ.jl/dev/) where instead of directly calling the method, one instantiates a model for the method while optionally passing the keyword parameters found in the functional interface then wraps the model in a `machine` and follows by calling `transform` on the machine and data.
Expand All @@ -75,7 +80,7 @@ All implemented oversampling methods are considered static transforms and hence,
If `MLJBalancing` is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).

```julia
using MLJBalancing
using MLJ, MLJBalancing

# grab two resamplers and a classifier
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
Expand Down
4 changes: 2 additions & 2 deletions src/generic_resample.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ $(COMMON_DOCS["RATIOS"])
new labels generated by the oversampling method for each class.
"""
function generic_oversample(
X::AbstractMatrix{<:Real},
X::AbstractMatrix{<:Union{Real, Missing}},
y::AbstractVector,
oversample_per_class,
args...;
Expand Down Expand Up @@ -83,7 +83,7 @@ This function is a generic implementation of undersampling methods that apply so
- `y_under`: An abstract vector of class labels corresponding to `X_under`
"""
function generic_undersample(
X::AbstractMatrix{<:Real},
X::AbstractMatrix{<:Union{Real, Missing}},
y::AbstractVector,
undersample_per_class,
args...;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ generate n new observations for that class using random oversampling
- `Xnew`: A matrix where each column is a new observation generated by ROSE
"""
function random_oversample_per_class(
X::AbstractMatrix{<:Real},
X::AbstractMatrix{<:Union{Real, Missing}},
n::Integer;
rng::AbstractRNG = default_rng(),
)
Expand Down Expand Up @@ -129,12 +129,13 @@ A full basic example along with an animation can be found [here](https://githubt
section which also explains running code on Google Colab.
"""
function random_oversample(
X::AbstractMatrix{<:Real},
X::AbstractMatrix{<:Union{Real, Missing}},
y::AbstractVector;
ratios = 1.0,
rng::Union{AbstractRNG,Integer} = default_rng(),
try_preserve_type::Bool = true,
)
println("lolss")
rng = rng_handler(rng)
Xover, yover = generic_oversample(X, y, random_oversample_per_class; ratios, rng,)
return Xover, yover
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ randomly remove n observations for that class using random undersampling
- `Xnew`: A matrix containing the undersampled observations
"""
function random_undersample_per_class(
X::AbstractMatrix{<:Real},
X::AbstractMatrix{<:Union{Real, Missing}},
n::Integer;
rng::AbstractRNG = default_rng(),
)
Expand Down Expand Up @@ -130,7 +130,7 @@ A full basic example along with an animation can be found [here](https://githubt
section which also explains running code on Google Colab.
"""
function random_undersample(
X::AbstractMatrix{<:Real},
X::AbstractMatrix{<:Union{Real, Missing}},
y::AbstractVector;
ratios = 1.0,
rng::Union{AbstractRNG, Integer} = default_rng(),
Expand Down
10 changes: 10 additions & 0 deletions test/oversampling/random_oversample.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,16 @@ using Imbalance: random_oversample
)
# Check that the number of uniques is the same
@test length(unique(Xover, dims = 1)) == length(unique(X, dims = 1))

y = ["A", "A", "B", "A", "B"]
X = [1 1.1 2.1;
1 1.2 2.2;
2 1.3 2.3;
1 1.4 missing;
2 1.5 2.5; ]
Xover, yover = random_oversample(X, y)
@test length(unique(Xover, dims = 1)) == length(unique(X, dims = 1))

end


Expand Down
9 changes: 9 additions & 0 deletions test/undersampling/random_undersample.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@ using Imbalance: random_undersample
)
# Check that X_under is a subset of X
@test issubset(Set(eachrow(X_under)), Set(eachrow(X)))

y = ["A", "A", "B", "A", "B"]
X = [1 1.1 2.1;
1 1.2 2.2;
2 1.3 2.3;
1 1.4 missing;
2 1.5 2.5; ]
X_under, y_under = random_undersample(X, y)
@test issubset(Set(eachrow(X_under)), Set(eachrow(X)))
end

# test that the materializer works for dataframes
Expand Down

0 comments on commit 122ef71

Please sign in to comment.