Merge pull request #69 from JuliaAI/dev

For a 0.1.2 release
JuliaAI · Oct 11, 2023 · 7199d68 · 7199d68
2 parents bff9803 + 4699250
commit 7199d68
Show file tree

Hide file tree

Showing 120 changed files with 6,801 additions and 15,524 deletions.
diff --git a/.github/codecov.yml b/.github/codecov.yml
@@ -0,0 +1,5 @@
+coverage:
+  status:
+    project:
+      default:
+        threshold: 0.5%
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -1,11 +1,11 @@
 name: CI
-on: [push, pull_request]
+on: [push]
 jobs:
   test:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        julia-version: ['1.8']
+        julia-version: ['1.6', '1']
         julia-arch: [x64]
         os: [ubuntu-latest, windows-latest, macOS-latest]
     steps:

diff --git a/Project.toml b/Project.toml
@@ -41,11 +41,11 @@ Tables = "1.10"
 TransformsBase = "1.2"
 julia = "1.6"
 
-
 [extras]
 Conda = "8f4d0f93-b110-5947-807f-2305c1781a2d"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 IOCapture = "b5f81e59-6552-4d32-b1f0-c071b021bf89"
+JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
 MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
 Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
 PyCall = "438e738f-606a-5dbb-bf0a-cddfbfd45ab0"
@@ -54,4 +54,4 @@ TableTransforms = "0d432bfd-3ee1-4ac1-886a-39f05cc69a3e"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [targets]
-test = ["Test", "DataFrames", "MLJBase", "TableTransforms", "StableRNGs", "PyCall", "Pkg", "Conda", "IOCapture"]
+test = ["Test", "DataFrames", "MLJBase", "TableTransforms", "StableRNGs", "PyCall", "Pkg", "Conda", "IOCapture", "JLD2"]
diff --git a/README.md b/README.md
@@ -75,6 +75,7 @@ Xover, yover = transform(mach, X, y)
 ```
 All implemented oversampling methods are considered static transforms and hence, no `fit` is required. 
 
+#### Pipelining Models
 If `MLJBalancing` is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).
 
 ```julia
@@ -141,9 +142,9 @@ $$\hat{\theta} = \arg\min_{\theta} \left( \frac{1}{N_1} \sum_{i \in C_1} L(f_{\t
 
 Class imbalance occurs when some classes have much fewer examples than other classes. In this case, the terms corresponding to smaller classes contribute minimally to the sum which makes it possible for any learning algorithm to find an approximate solution to minimizing the empirical risk that mostly only minimizes the over the significant sums. This yields a hypothesis $f_\theta$ that may be very different from the true target $f$ with respect to the minority classes which may be the most important for the application in question.
 
-One obvious possible remedy is to weight the smaller sums so that a learning algorithm more easily avoids approximate solutions that exploit their insignificance which can be seen to be equivalent to repeating examples of the observations in minority classes. This can be achieved by naive random oversampling which is offered by this package along with other more advanced oversampling methods that function by generating synthetic data, which ideally would be analogous to one of the most plausible solutions to the class imbalance problem: collecting more data.
+One obvious possible remedy is to weight the smaller sums so that a learning algorithm more easily avoids approximate solutions that exploit their insignificance which can be seen to be equivalent to repeating examples of the observations in minority classes. This can be achieved by naive random oversampling which is offered by this package along with other more advanced oversampling methods that function by generating synthetic data or deleting existing ones. You can read more about the class imbalance problem and learn about various algorithms implemented in this package by reading [this](https://medium.com/@essamwissam/class-imbalance-and-oversampling-a-formal-introduction-c77b918e586d) series of articles on Medium.
 
 To our knowledge, there are no existing maintained Julia packages that implement resampling algorithms for multi-class classification problems or that handle both nominal and continuous features. This has served as a primary motivation for the creation of this package.
 
 ## 👥 Credits
-This package was created by [Essam Wisam](https://github.com/JuliaAI) as a Google Summer of Code project, under the mentorship of [Anthony Blaom](https://ablaom.github.io). Special thanks also go to [Rik Huijzer](https://github.com/rikhuijzer) for his friendliness and the binary `SMOTE` implementation in `Resample.jl`.
+This package was created by [Essam Wisam](https://github.com/JuliaAI) as a Google Summer of Code project, under the mentorship of [Anthony Blaom](https://ablaom.github.io). Special thanks also go to [Rik Huijzer](https://github.com/rikhuijzer) for his friendliness and the binary `SMOTE` implementation in `Resample.jl`.
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -4,6 +4,7 @@ CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 DocumenterTools = "35a29f4d-8980-5a13-9543-d66fff28ecb8"
+HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
 Imbalance = "c709b415-507b-45b7-9a3d-1767c89fde68"
 Impute = "f7bf1975-0170-51b9-8c5f-a992d46b9575"
 LIBSVM = "b1bec4e5-fd48-53fe-b0cb-9723c09d164b"
@@ -14,7 +15,6 @@ MLJFlux = "094fc8d1-fd35-5302-93ea-dabda2abf845"
 MLJLIBSVMInterface = "61c7150f-6c77-4bb1-949c-13197eac2a52"
 MLJNaiveBayesInterface = "33e4bacb-b9e2-458e-9a13-5d9a90b235fa"
 MLJScikitLearnInterface = "5ae90465-5518-4432-b9d2-8a1def2f0cab"
-MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
 Measures = "442fdcdd-2543-5da2-b0f3-8c86c306513e"
 NaiveBayes = "9bbee03b-0db5-5f46-924f-b5c9c21b8c60"
 OneRule = "90484964-6d6a-4979-af09-8657dbed84ff"
@@ -32,7 +32,6 @@ DocumenterTools = "0.1"
 Imbalance = "0.1"
 MLJ = "0.19"
 MLJBase = "0.21"
-MLUtils = "0.4"
 Plots = "1.39"
 ScientificTypes = "3.0"
 TableTransforms = "1.10"
diff --git a/docs/example-gen.jl b/docs/example-gen.jl
@@ -78,7 +78,8 @@ for item in data
       <a href="$colab_link"><img id="colab" src="./assets/colab.png"/></a>
       <a href="$link">
       <img src="$img_src" alt="Image">
-      <div class="item-title">$title
+      <div class="item-title">
+      <b>$title</b>
       <p>$description</p>
       </div>
       </a>

diff --git a/docs/make.jl b/docs/make.jl
@@ -27,15 +27,18 @@ makedocs(sitename = "Imbalance.jl",
 			"Oversampling"=>"algorithms/oversampling_algorithms.md",
 			"Undersampling"=>"algorithms/undersampling_algorithms.md",
       "Combination"=>"algorithms/mlj_balancing.md",
+	  "Implementation Notes"=>"algorithms/implementation_notes.md",
 			"Extras"=>"algorithms/extra_algorithms.md",
+
 		],
-    "Walkthrough" => Any[
+    "Tutorial" => Any[
 			"Introduction"=>"examples/walkthrough.md",
 			"More Examples"=>"examples.md",
+			"Google Colab"=>"examples/Colab.md"
 		],
 		"Contributing" => "contributing.md",
 		"About" => "about.md"],
-	warnonly = true,
+		warnonly=true
 )
 
 

diff --git a/docs/src/algorithms/implementation_notes.md b/docs/src/algorithms/implementation_notes.md
@@ -0,0 +1,8 @@
+
+### Generalizing to Multiclass
+Papers often propose the resampling algorithm for the case of binary classification only. In many cases, the algorithm only expects a set of points to resample and has nothing to do with the existence of a majority class (e.g., estimates the distribution of points then generates new samples from it) so it can be generalized by simply applying the algorithm on each class. In other cases, there is an interaction with the majority class (e.g., a point is borderline in `BorderlineSMOTE1` if the majority but not all its neighbors are from the majority class). In this case, a one-vs-rest scheme is used as proposed in [1]. For instance, a point is now borderline if most but not all its neighbors are from a different class. 
+
+### Generalizing to Real Ratios
+Papers often proposes the resampling algorithm using integer ratios. For instance, a ratio of `2` would mean to double the amount of data in a class and a ratio of $2.2$ is not allowed or will be rounded. In `Imbalance.jl` any appropriate real ratio can be used and the ratio is relative to the size of the majority or minority class depending on whether the algorithm is oversampling or undersampling. The generalization occurs by randomly choosing points instead of looping on each point. That is, if a $2.2$ ratio corresponds to $227$ examples then $227$ examples are chosen randomly by replacement then applying resampling logic to each. Given an integer ratio $k$, this falls back to be on average equivalent to looping on the points $k$ times.
+
+[1] López, V., Fernández, A., Moreno-Torres, J.G., & Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7), 6585-6608.
diff --git a/docs/src/algorithms/oversampling_algorithms.md b/docs/src/algorithms/oversampling_algorithms.md
@@ -2,15 +2,15 @@
 
 The following table portrays the supported oversampling algorithms, whether the mechanism repeats or generates data and the supported types of data.
 
-| Oversampling Method | Mechanism | Supported Data Type |
+| Oversampling Method | Mechanism | Supported Data Types |
 |:----------:|:----------:|:----------:|
-| Random Oversampler | Repeat existing data | Continuous and/or nominal  |
-| Random Walk Oversampler | Generate synthetic data | Continuous and/or nominal |
-| ROSE | Generate synthetic data | Continuous |
-| SMOTE | Generate synthetic data | Continuous |
-| BorderlineSMOTE1 | Generate synthetic data | Continuous |
-| SMOTE-N | Generate synthetic data | Nominal |
-| SMOTE-NC | Generate synthetic data | Continuous and nominal |
+| [Random Oversampler](@ref) | Repeat existing data | Continuous and/or nominal  |
+| [Random Walk Oversampler](@ref) | Generate synthetic data | Continuous and/or nominal |
+| [ROSE](@ref) | Generate synthetic data | Continuous |
+| [SMOTE](@ref) | Generate synthetic data | Continuous |
+| [Borderline SMOTE1](@ref) | Generate synthetic data | Continuous |
+| [SMOTE-N](@ref) | Generate synthetic data | Nominal |
+| [SMOTE-NC](@ref) | Generate synthetic data | Continuous and nominal |
 
 
 ## Random Oversampler

diff --git a/docs/src/algorithms/undersampling_algorithms.md b/docs/src/algorithms/undersampling_algorithms.md
@@ -2,12 +2,12 @@
 
 The following table portrays the supported undersampling algorithms, whether the mechanism deletes or generates new data and the supported types of data.
 
-| Undersampling Method | Mechanism | Supported Data Type |
+| Undersampling Method | Mechanism | Supported Data Types |
 |:----------:|:----------:|:----------:|
-| Random Undersampler | Delete existing data as needed | Continuous and/or nominal  |
-| Cluster Undersampler | Generate new data or delete existing data | Continuous |
-| ENN Undersampler | Delete existing data meeting certain conditions (cleaning) | Continuous |
-| Tomek Undersampler | Delete existing data meeting certain conditions (cleaning) | Continuous |
+| [Random Undersampler](@ref) | Delete existing data as needed | Continuous and/or nominal  |
+| [Cluster Undersampler](@ref) | Generate new data or delete existing data | Continuous |
+| [Edited Nearest Neighbors Undersampler](@ref) | Delete existing data meeting certain conditions (cleaning) | Continuous |
+| [Tomek Links Undersampler](@ref) | Delete existing data meeting certain conditions (cleaning) | Continuous |
 
 
 

diff --git a/docs/src/assets/light.scss b/docs/src/assets/light.scss
@@ -99,4 +99,8 @@ code.nohighlight.hljs {
   right: 3px;
   width: 11%;
   display: none;
+}
+
+.content pre {
+  border-radius: 1rem !important;
 }
diff --git a/docs/src/contributing.md b/docs/src/contributing.md
@@ -37,10 +37,10 @@ Any method resampling method implemented in the `oversampling_methods` or `under
 Surely, you can ignore ignore the third step if the algorithm you are implementing does not operate in "per-class" sense.
 
 ### 🔥 Hot algorithms to add
-- `K-Means SMOTE`: Takes care of where exactly to generate more points using SMOTE by factoring in "within class imbalance"
+- `K-Means SMOTE`: Takes care of where exactly to generate more points using `SMOTE` by factoring in "within class imbalance". This may be also easily generalized to algorithms beyond `SMOTE`.
 - `CondensedNearestNeighbors`: Undersamples the dataset such as to perserve the decision boundary by `KNN`
-- `BorderlineSMOTE2`: A small modification of the BorderlineSMOTE1 condition
-- `RepeatedENNUndersampler`: Simply repeat `ENNUndersampler` multiple times
+- `BorderlineSMOTE2`: A small modification of the `BorderlineSMOTE1` condition
+- `RepeatedENNUndersampler`: Simply repeats `ENNUndersampler` multiple times
 
 # Adding New Tutorials
 - Make a new notebook with the tutorial in the `examples` folder found in `docs/src/examples`

diff --git a/docs/src/examples.md b/docs/src/examples.md
@@ -5,7 +5,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/effect_of_ratios/effect_of_ratios.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./effect_of_ratios/effect_of_ratios">
   <img src="./assets/iris smote.jpeg" alt="Image">
-  <div class="item-title">Effect of Ratios Hyperparameter
+  <div class="item-title">
+  <b>Effect of Ratios Hyperparameter</b>
   <p>In this tutorial we use an SVM and SMOTE and the Iris data to study 
                       how the decision regions change with the amount of oversampling</p>
   </div>
@@ -15,7 +16,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/effect_of_s/effect_of_s.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./effect_of_s/effect_of_s">
   <img src="./assets/iris rose.jpeg" alt="Image">
-  <div class="item-title">From Random Oversampling to ROSE
+  <div class="item-title">
+  <b>From Random Oversampling to ROSE</b>
   <p>In this tutorial we study the `s` parameter in rose and the effect
                         of increasing it.</p>
   </div>
@@ -25,7 +27,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/smote_churn_dataset/smote_churn_dataset.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./smote_churn_dataset/smote_churn_dataset">
   <img src="./assets/churn smote.jpeg" alt="Image">
-  <div class="item-title">SMOTE on Customer Churn Data
+  <div class="item-title">
+  <b>SMOTE on Customer Churn Data</b>
   <p>In this tutorial we apply SMOTE and random forest to predict customer churn based 
                         on continuous attributes.</p>
   </div>
@@ -35,7 +38,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/smote_mushroom/smoten_mushroom.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./smoten_mushroom/smoten_mushroom">
   <img src="./assets/mushy.jpeg" alt="Image">
-  <div class="item-title">SMOTEN on Mushroom Data
+  <div class="item-title">
+  <b>SMOTEN on Mushroom Data</b>
   <p>In this tutorial we use a purely categorical dataset to predict mushroom odour.</p>
   </div>
   </a>
@@ -44,7 +48,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/smotenc_churn_dataset/smotenc_churn_dataset.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./smotenc_churn_dataset/smotenc_churn_dataset">
   <img src="./assets/churn smoten.jpeg" alt="Image">
-  <div class="item-title">SMOTENC on Customer Churn Data
+  <div class="item-title">
+  <b>SMOTENC on Customer Churn Data</b>
   <p>In this tutorial we extend the SMOTE tutorial to include both categorical and continuous
                         data for churn prediction</p>
   </div>
@@ -54,7 +59,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/effect_of_k_enn/effect_of_k_enn.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./effect_of_k_enn/effect_of_k_enn">
   <img src="./assets/bmi.jpeg" alt="Image">
-  <div class="item-title">Effect of ENN Hyperparameters
+  <div class="item-title">
+  <b>Effect of ENN Hyperparameters</b>
   <p>In this tutorial we oberve the effects of the hyperparameters found in ENN undersampling with an SVM model</p>
   </div>
   </a>
@@ -63,7 +69,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/fraud_detection/fraud_detection.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./fraud_detection/fraud_detection">
   <img src="./assets/eth.jpeg" alt="Image">
-  <div class="item-title">SMOTE-Tomek for Ethereum Fraud Detection
+  <div class="item-title">
+  <b>SMOTE-Tomek for Ethereum Fraud Detection</b>
   <p>In this tutorial we combine SMOTE with TomekUndersampler and a classification model from MLJ for fraud detection</p>
   </div>
   </a>
@@ -72,7 +79,8 @@
   <a href="https://githubtocolab.com/JuliaAI/Imbalance.jl/blob/dev/docs/src/examples/cerebral_ensemble/cerebral_ensemble.ipynb"><img id="colab" src="./assets/colab.png"/></a>
   <a href="./cerebral_ensemble/cerebral_ensemble">
   <img src="./assets/brain.jpeg" alt="Image">
-  <div class="item-title">BalancedBagging for Cerebral Stroke Prediction
+  <div class="item-title">
+  <b>BalancedBagging for Cerebral Stroke Prediction</b>
   <p>In this tutorial we use BalancedBagging from MLJBalancing with Decision Tree to predict Cerebral Strokes</p>
   </div>
   </a>

diff --git a/docs/src/examples/Colab.md b/docs/src/examples/Colab.md
@@ -0,0 +1,23 @@
+# Google Colab
+
+It is possible to run tutorials found in the examples section or API documentation on Google colab. It should be evident how so by launching the notebook. This section describes what happens under the hood.
+
+- The first cell runs the following bash script to install Julia:
+
+```julia
+%%capture
+%%shell
+if ! command -v julia 3>&1 > /dev/null
+then
+    wget -q 'https://julialang-s3.julialang.org/bin/linux/x64/1.7/julia-1.7.2-linux-x86_64.tar.gz' \
+        -O /tmp/julia.tar.gz
+    tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
+    rm /tmp/julia.tar.gz
+fi
+julia -e 'using Pkg; pkg"add IJulia; precompile;"'
+echo 'Done'
+```
+
+- Once that is done, one can change the runtime to `Julia` by choosing `Runtime` from the toolbar then `Change runtime type` and at this point they can delete the cell
+
+Sincere thanks to [Julia-on-Colab](https://github.com/Dsantra92/Julia-on-Colab) for making this possible.
diff --git a/docs/src/examples/datasets/bmi.csv → docs/src/examples/bmi.csv b/docs/src/examples/datasets/bmi.csv → docs/src/examples/bmi.csv
diff --git a/docs/src/examples/datasets/cerebral.csv → ...c/examples/cerebral_ensemble/cerebral.csv b/docs/src/examples/datasets/cerebral.csv → ...c/examples/cerebral_ensemble/cerebral.csv
diff --git a/docs/src/examples/cerebral_ensemble/cerebral_ensemble.ipynb b/docs/src/examples/cerebral_ensemble/cerebral_ensemble.ipynb
@@ -1,5 +1,20 @@
 {
  "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9d1d017",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# this installs Julia 1.7\n",
+    "%%capture\n",
+    "%%shell\n",
+    "wget -O - https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/colab.sh | bash\n",
+    "#This should take around one minute to finish. Once it does, change the runtime to `Julia` by choosing `Runtime` \n",
+    "# from the toolbar then `Change runtime type`. You can then delete this cell."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "50fbf3a7",
@@ -22,6 +37,10 @@
    },
    "outputs": [],
    "source": [
+    "import Pkg;\n",
+    "Pkg.add([\"Random\", \"CSV\", \"DataFrames\", \"MLJ\", \"Imbalance\", \"MLJBalancing\", \n",
+    "         \"ScientificTypes\",\"Impute\", \"StatsBase\",  \"Plots\", \"Measures\", \"HTTP\"])\n",
+    "\n",
     "using Random\n",
     "using CSV\n",
     "using DataFrames\n",
@@ -31,7 +50,8 @@
     "using StatsBase\n",
     "using ScientificTypes\n",
     "using Plots, Measures\n",
-    "using Impute"
+    "using Impute\n",
+    "using HTTP: download"
    ]
   },
   {
@@ -40,7 +60,7 @@
    "metadata": {},
    "source": [
     "## Loading Data\n",
-    "In this example, we will consider the [Cerebral Stroke Prediction Dataset](https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset) found on Kaggle for the objective of predicting where a stroke has occured given medical features about patients.\n",
+    "In this example, we will consider the [Cerebral Stroke Prediction Dataset](https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset) found on Kaggle for the objective of predicting where a stroke has occurred given medical features about patients.\n",
     "\n",
     "`CSV` gives us the ability to easily read the dataset after it's downloaded as follows"
    ]
@@ -77,10 +97,11 @@
     }
    ],
    "source": [
-    "df = CSV.read(\"../datasets/cerebral.csv\", DataFrame)\n",
+    "download(\"https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/cerebral_ensemble/cerebral.csv\", \"./\")\n",
+    "df = CSV.read(\"./cerebral.csv\", DataFrame)\n",
     "\n",
     "# Display the first 5 rows with DataFrames\n",
-    "first(df, 5) |> pretty\n"
+    "first(df, 5) |> pretty"
    ]
   },
   {
@@ -931,7 +952,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "[NbConvertApp] Writing 25642 bytes to cerebral_ensemble.md\n"
+      "[NbConvertApp] Writing 26304 bytes to cerebral_ensemble.md\n"
      ]
     }
    ],