Skip to content

Commit

Permalink
Merge pull request #69 from JuliaAI/dev
Browse files Browse the repository at this point in the history
For a 0.1.2 release
  • Loading branch information
EssamWisam authored Oct 11, 2023
2 parents bff9803 + 4699250 commit 7199d68
Show file tree
Hide file tree
Showing 120 changed files with 6,801 additions and 15,524 deletions.
5 changes: 5 additions & 0 deletions .github/codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
coverage:
status:
project:
default:
threshold: 0.5%
4 changes: 2 additions & 2 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
name: CI
on: [push, pull_request]
on: [push]
jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
matrix:
julia-version: ['1.8']
julia-version: ['1.6', '1']
julia-arch: [x64]
os: [ubuntu-latest, windows-latest, macOS-latest]
steps:
Expand Down
4 changes: 2 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,11 @@ Tables = "1.10"
TransformsBase = "1.2"
julia = "1.6"


[extras]
Conda = "8f4d0f93-b110-5947-807f-2305c1781a2d"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
IOCapture = "b5f81e59-6552-4d32-b1f0-c071b021bf89"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
PyCall = "438e738f-606a-5dbb-bf0a-cddfbfd45ab0"
Expand All @@ -54,4 +54,4 @@ TableTransforms = "0d432bfd-3ee1-4ac1-886a-39f05cc69a3e"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test", "DataFrames", "MLJBase", "TableTransforms", "StableRNGs", "PyCall", "Pkg", "Conda", "IOCapture"]
test = ["Test", "DataFrames", "MLJBase", "TableTransforms", "StableRNGs", "PyCall", "Pkg", "Conda", "IOCapture", "JLD2"]
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ Xover, yover = transform(mach, X, y)
```
All implemented oversampling methods are considered static transforms and hence, no `fit` is required.

#### Pipelining Models
If `MLJBalancing` is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).

```julia
Expand Down Expand Up @@ -141,9 +142,9 @@ $$\hat{\theta} = \arg\min_{\theta} \left( \frac{1}{N_1} \sum_{i \in C_1} L(f_{\t

Class imbalance occurs when some classes have much fewer examples than other classes. In this case, the terms corresponding to smaller classes contribute minimally to the sum which makes it possible for any learning algorithm to find an approximate solution to minimizing the empirical risk that mostly only minimizes the over the significant sums. This yields a hypothesis $f_\theta$ that may be very different from the true target $f$ with respect to the minority classes which may be the most important for the application in question.

One obvious possible remedy is to weight the smaller sums so that a learning algorithm more easily avoids approximate solutions that exploit their insignificance which can be seen to be equivalent to repeating examples of the observations in minority classes. This can be achieved by naive random oversampling which is offered by this package along with other more advanced oversampling methods that function by generating synthetic data, which ideally would be analogous to one of the most plausible solutions to the class imbalance problem: collecting more data.
One obvious possible remedy is to weight the smaller sums so that a learning algorithm more easily avoids approximate solutions that exploit their insignificance which can be seen to be equivalent to repeating examples of the observations in minority classes. This can be achieved by naive random oversampling which is offered by this package along with other more advanced oversampling methods that function by generating synthetic data or deleting existing ones. You can read more about the class imbalance problem and learn about various algorithms implemented in this package by reading [this](https://medium.com/@essamwissam/class-imbalance-and-oversampling-a-formal-introduction-c77b918e586d) series of articles on Medium.

To our knowledge, there are no existing maintained Julia packages that implement resampling algorithms for multi-class classification problems or that handle both nominal and continuous features. This has served as a primary motivation for the creation of this package.

## 👥 Credits
This package was created by [Essam Wisam](https://github.com/JuliaAI) as a Google Summer of Code project, under the mentorship of [Anthony Blaom](https://ablaom.github.io). Special thanks also go to [Rik Huijzer](https://github.com/rikhuijzer) for his friendliness and the binary `SMOTE` implementation in `Resample.jl`.
This package was created by [Essam Wisam](https://github.com/JuliaAI) as a Google Summer of Code project, under the mentorship of [Anthony Blaom](https://ablaom.github.io). Special thanks also go to [Rik Huijzer](https://github.com/rikhuijzer) for his friendliness and the binary `SMOTE` implementation in `Resample.jl`.
3 changes: 1 addition & 2 deletions docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
DocumenterTools = "35a29f4d-8980-5a13-9543-d66fff28ecb8"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
Imbalance = "c709b415-507b-45b7-9a3d-1767c89fde68"
Impute = "f7bf1975-0170-51b9-8c5f-a992d46b9575"
LIBSVM = "b1bec4e5-fd48-53fe-b0cb-9723c09d164b"
Expand All @@ -14,7 +15,6 @@ MLJFlux = "094fc8d1-fd35-5302-93ea-dabda2abf845"
MLJLIBSVMInterface = "61c7150f-6c77-4bb1-949c-13197eac2a52"
MLJNaiveBayesInterface = "33e4bacb-b9e2-458e-9a13-5d9a90b235fa"
MLJScikitLearnInterface = "5ae90465-5518-4432-b9d2-8a1def2f0cab"
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
Measures = "442fdcdd-2543-5da2-b0f3-8c86c306513e"
NaiveBayes = "9bbee03b-0db5-5f46-924f-b5c9c21b8c60"
OneRule = "90484964-6d6a-4979-af09-8657dbed84ff"
Expand All @@ -32,7 +32,6 @@ DocumenterTools = "0.1"
Imbalance = "0.1"
MLJ = "0.19"
MLJBase = "0.21"
MLUtils = "0.4"
Plots = "1.39"
ScientificTypes = "3.0"
TableTransforms = "1.10"
3 changes: 2 additions & 1 deletion docs/example-gen.jl
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,8 @@ for item in data
<a href="$colab_link"><img id="colab" src="./assets/colab.png"/></a>
<a href="$link">
<img src="$img_src" alt="Image">
<div class="item-title">$title
<div class="item-title">
<b>$title</b>
<p>$description</p>
</div>
</a>
Expand Down
7 changes: 5 additions & 2 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,18 @@ makedocs(sitename = "Imbalance.jl",
"Oversampling"=>"algorithms/oversampling_algorithms.md",
"Undersampling"=>"algorithms/undersampling_algorithms.md",
"Combination"=>"algorithms/mlj_balancing.md",
"Implementation Notes"=>"algorithms/implementation_notes.md",
"Extras"=>"algorithms/extra_algorithms.md",

],
"Walkthrough" => Any[
"Tutorial" => Any[
"Introduction"=>"examples/walkthrough.md",
"More Examples"=>"examples.md",
"Google Colab"=>"examples/Colab.md"
],
"Contributing" => "contributing.md",
"About" => "about.md"],
warnonly = true,
warnonly=true
)


Expand Down
8 changes: 8 additions & 0 deletions docs/src/algorithms/implementation_notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

### Generalizing to Multiclass
Papers often propose the resampling algorithm for the case of binary classification only. In many cases, the algorithm only expects a set of points to resample and has nothing to do with the existence of a majority class (e.g., estimates the distribution of points then generates new samples from it) so it can be generalized by simply applying the algorithm on each class. In other cases, there is an interaction with the majority class (e.g., a point is borderline in `BorderlineSMOTE1` if the majority but not all its neighbors are from the majority class). In this case, a one-vs-rest scheme is used as proposed in [1]. For instance, a point is now borderline if most but not all its neighbors are from a different class.

### Generalizing to Real Ratios
Papers often proposes the resampling algorithm using integer ratios. For instance, a ratio of `2` would mean to double the amount of data in a class and a ratio of $2.2$ is not allowed or will be rounded. In `Imbalance.jl` any appropriate real ratio can be used and the ratio is relative to the size of the majority or minority class depending on whether the algorithm is oversampling or undersampling. The generalization occurs by randomly choosing points instead of looping on each point. That is, if a $2.2$ ratio corresponds to $227$ examples then $227$ examples are chosen randomly by replacement then applying resampling logic to each. Given an integer ratio $k$, this falls back to be on average equivalent to looping on the points $k$ times.

[1] López, V., Fernández, A., Moreno-Torres, J.G., & Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7), 6585-6608.
16 changes: 8 additions & 8 deletions docs/src/algorithms/oversampling_algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

The following table portrays the supported oversampling algorithms, whether the mechanism repeats or generates data and the supported types of data.

| Oversampling Method | Mechanism | Supported Data Type |
| Oversampling Method | Mechanism | Supported Data Types |
|:----------:|:----------:|:----------:|
| Random Oversampler | Repeat existing data | Continuous and/or nominal |
| Random Walk Oversampler | Generate synthetic data | Continuous and/or nominal |
| ROSE | Generate synthetic data | Continuous |
| SMOTE | Generate synthetic data | Continuous |
| BorderlineSMOTE1 | Generate synthetic data | Continuous |
| SMOTE-N | Generate synthetic data | Nominal |
| SMOTE-NC | Generate synthetic data | Continuous and nominal |
| [Random Oversampler](@ref) | Repeat existing data | Continuous and/or nominal |
| [Random Walk Oversampler](@ref) | Generate synthetic data | Continuous and/or nominal |
| [ROSE](@ref) | Generate synthetic data | Continuous |
| [SMOTE](@ref) | Generate synthetic data | Continuous |
| [Borderline SMOTE1](@ref) | Generate synthetic data | Continuous |
| [SMOTE-N](@ref) | Generate synthetic data | Nominal |
| [SMOTE-NC](@ref) | Generate synthetic data | Continuous and nominal |


## Random Oversampler
Expand Down
10 changes: 5 additions & 5 deletions docs/src/algorithms/undersampling_algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

The following table portrays the supported undersampling algorithms, whether the mechanism deletes or generates new data and the supported types of data.

| Undersampling Method | Mechanism | Supported Data Type |
| Undersampling Method | Mechanism | Supported Data Types |
|:----------:|:----------:|:----------:|
| Random Undersampler | Delete existing data as needed | Continuous and/or nominal |
| Cluster Undersampler | Generate new data or delete existing data | Continuous |
| ENN Undersampler | Delete existing data meeting certain conditions (cleaning) | Continuous |
| Tomek Undersampler | Delete existing data meeting certain conditions (cleaning) | Continuous |
| [Random Undersampler](@ref) | Delete existing data as needed | Continuous and/or nominal |
| [Cluster Undersampler](@ref) | Generate new data or delete existing data | Continuous |
| [Edited Nearest Neighbors Undersampler](@ref) | Delete existing data meeting certain conditions (cleaning) | Continuous |
| [Tomek Links Undersampler](@ref) | Delete existing data meeting certain conditions (cleaning) | Continuous |



Expand Down
4 changes: 4 additions & 0 deletions docs/src/assets/light.scss
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,8 @@ code.nohighlight.hljs {
right: 3px;
width: 11%;
display: none;
}

.content pre {
border-radius: 1rem !important;
}
6 changes: 3 additions & 3 deletions docs/src/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ Any method resampling method implemented in the `oversampling_methods` or `under
Surely, you can ignore ignore the third step if the algorithm you are implementing does not operate in "per-class" sense.

### 🔥 Hot algorithms to add
- `K-Means SMOTE`: Takes care of where exactly to generate more points using SMOTE by factoring in "within class imbalance"
- `K-Means SMOTE`: Takes care of where exactly to generate more points using `SMOTE` by factoring in "within class imbalance". This may be also easily generalized to algorithms beyond `SMOTE`.
- `CondensedNearestNeighbors`: Undersamples the dataset such as to perserve the decision boundary by `KNN`
- `BorderlineSMOTE2`: A small modification of the BorderlineSMOTE1 condition
- `RepeatedENNUndersampler`: Simply repeat `ENNUndersampler` multiple times
- `BorderlineSMOTE2`: A small modification of the `BorderlineSMOTE1` condition
- `RepeatedENNUndersampler`: Simply repeats `ENNUndersampler` multiple times

# Adding New Tutorials
- Make a new notebook with the tutorial in the `examples` folder found in `docs/src/examples`
Expand Down
24 changes: 16 additions & 8 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/effect_of_ratios/effect_of_ratios.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./effect_of_ratios/effect_of_ratios">
<img src="./assets/iris smote.jpeg" alt="Image">
<div class="item-title">Effect of Ratios Hyperparameter
<div class="item-title">
<b>Effect of Ratios Hyperparameter</b>
<p>In this tutorial we use an SVM and SMOTE and the Iris data to study
how the decision regions change with the amount of oversampling</p>
</div>
Expand All @@ -15,7 +16,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/effect_of_s/effect_of_s.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./effect_of_s/effect_of_s">
<img src="./assets/iris rose.jpeg" alt="Image">
<div class="item-title">From Random Oversampling to ROSE
<div class="item-title">
<b>From Random Oversampling to ROSE</b>
<p>In this tutorial we study the `s` parameter in rose and the effect
of increasing it.</p>
</div>
Expand All @@ -25,7 +27,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/smote_churn_dataset/smote_churn_dataset.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./smote_churn_dataset/smote_churn_dataset">
<img src="./assets/churn smote.jpeg" alt="Image">
<div class="item-title">SMOTE on Customer Churn Data
<div class="item-title">
<b>SMOTE on Customer Churn Data</b>
<p>In this tutorial we apply SMOTE and random forest to predict customer churn based
on continuous attributes.</p>
</div>
Expand All @@ -35,7 +38,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/smote_mushroom/smoten_mushroom.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./smoten_mushroom/smoten_mushroom">
<img src="./assets/mushy.jpeg" alt="Image">
<div class="item-title">SMOTEN on Mushroom Data
<div class="item-title">
<b>SMOTEN on Mushroom Data</b>
<p>In this tutorial we use a purely categorical dataset to predict mushroom odour.</p>
</div>
</a>
Expand All @@ -44,7 +48,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./smotenc_churn_dataset/smotenc_churn_dataset">
<img src="./assets/churn smoten.jpeg" alt="Image">
<div class="item-title">SMOTENC on Customer Churn Data
<div class="item-title">
<b>SMOTENC on Customer Churn Data</b>
<p>In this tutorial we extend the SMOTE tutorial to include both categorical and continuous
data for churn prediction</p>
</div>
Expand All @@ -54,7 +59,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/effect_of_k_enn/effect_of_k_enn.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./effect_of_k_enn/effect_of_k_enn">
<img src="./assets/bmi.jpeg" alt="Image">
<div class="item-title">Effect of ENN Hyperparameters
<div class="item-title">
<b>Effect of ENN Hyperparameters</b>
<p>In this tutorial we oberve the effects of the hyperparameters found in ENN undersampling with an SVM model</p>
</div>
</a>
Expand All @@ -63,7 +69,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/fraud_detection/fraud_detection.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./fraud_detection/fraud_detection">
<img src="./assets/eth.jpeg" alt="Image">
<div class="item-title">SMOTE-Tomek for Ethereum Fraud Detection
<div class="item-title">
<b>SMOTE-Tomek for Ethereum Fraud Detection</b>
<p>In this tutorial we combine SMOTE with TomekUndersampler and a classification model from MLJ for fraud detection</p>
</div>
</a>
Expand All @@ -72,7 +79,8 @@
<a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/cerebral_ensemble/cerebral_ensemble.ipynb"><img id="colab" src="./assets/colab.png"/></a>
<a href="./cerebral_ensemble/cerebral_ensemble">
<img src="./assets/brain.jpeg" alt="Image">
<div class="item-title">BalancedBagging for Cerebral Stroke Prediction
<div class="item-title">
<b>BalancedBagging for Cerebral Stroke Prediction</b>
<p>In this tutorial we use BalancedBagging from MLJBalancing with Decision Tree to predict Cerebral Strokes</p>
</div>
</a>
Expand Down
23 changes: 23 additions & 0 deletions docs/src/examples/Colab.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Google Colab

It is possible to run tutorials found in the examples section or API documentation on Google colab. It should be evident how so by launching the notebook. This section describes what happens under the hood.

- The first cell runs the following bash script to install Julia:

```julia
%%capture
%%shell
if ! command -v julia 3>&1 > /dev/null
then
wget -q 'https://julialang-s3.julialang.org/bin/linux/x64/1.7/julia-1.7.2-linux-x86_64.tar.gz' \
-O /tmp/julia.tar.gz
tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
rm /tmp/julia.tar.gz
fi
julia -e 'using Pkg; pkg"add IJulia; precompile;"'
echo 'Done'
```

- Once that is done, one can change the runtime to `Julia` by choosing `Runtime` from the toolbar then `Change runtime type` and at this point they can delete the cell

Sincere thanks to [Julia-on-Colab](https://github.com/Dsantra92/Julia-on-Colab) for making this possible.
File renamed without changes.
File renamed without changes.
31 changes: 26 additions & 5 deletions docs/src/examples/cerebral_ensemble/cerebral_ensemble.ipynb
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "c9d1d017",
"metadata": {},
"outputs": [],
"source": [
"# this installs Julia 1.7\n",
"%%capture\n",
"%%shell\n",
"wget -O - https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/colab.sh | bash\n",
"#This should take around one minute to finish. Once it does, change the runtime to `Julia` by choosing `Runtime` \n",
"# from the toolbar then `Change runtime type`. You can then delete this cell."
]
},
{
"cell_type": "markdown",
"id": "50fbf3a7",
Expand All @@ -22,6 +37,10 @@
},
"outputs": [],
"source": [
"import Pkg;\n",
"Pkg.add([\"Random\", \"CSV\", \"DataFrames\", \"MLJ\", \"Imbalance\", \"MLJBalancing\", \n",
" \"ScientificTypes\",\"Impute\", \"StatsBase\", \"Plots\", \"Measures\", \"HTTP\"])\n",
"\n",
"using Random\n",
"using CSV\n",
"using DataFrames\n",
Expand All @@ -31,7 +50,8 @@
"using StatsBase\n",
"using ScientificTypes\n",
"using Plots, Measures\n",
"using Impute"
"using Impute\n",
"using HTTP: download"
]
},
{
Expand All @@ -40,7 +60,7 @@
"metadata": {},
"source": [
"## Loading Data\n",
"In this example, we will consider the [Cerebral Stroke Prediction Dataset](https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset) found on Kaggle for the objective of predicting where a stroke has occured given medical features about patients.\n",
"In this example, we will consider the [Cerebral Stroke Prediction Dataset](https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset) found on Kaggle for the objective of predicting where a stroke has occurred given medical features about patients.\n",
"\n",
"`CSV` gives us the ability to easily read the dataset after it's downloaded as follows"
]
Expand Down Expand Up @@ -77,10 +97,11 @@
}
],
"source": [
"df = CSV.read(\"../datasets/cerebral.csv\", DataFrame)\n",
"download(\"https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/cerebral_ensemble/cerebral.csv\", \"./\")\n",
"df = CSV.read(\"./cerebral.csv\", DataFrame)\n",
"\n",
"# Display the first 5 rows with DataFrames\n",
"first(df, 5) |> pretty\n"
"first(df, 5) |> pretty"
]
},
{
Expand Down Expand Up @@ -931,7 +952,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"[NbConvertApp] Writing 25642 bytes to cerebral_ensemble.md\n"
"[NbConvertApp] Writing 26304 bytes to cerebral_ensemble.md\n"
]
}
],
Expand Down
Loading

0 comments on commit 7199d68

Please sign in to comment.