Skip to content

Commit

Permalink
Merge pull request #75 from JuliaAI/dev
Browse files Browse the repository at this point in the history
For a 0.1.3 release
  • Loading branch information
EssamWisam authored Jan 6, 2024
2 parents b03a642 + 60b4461 commit f631a95
Show file tree
Hide file tree
Showing 6 changed files with 24 additions and 10 deletions.
6 changes: 5 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
name = "Imbalance"
uuid = "c709b415-507b-45b7-9a3d-1767c89fde68"
authors = ["Essam Wisam <[email protected]>", "Anthony Blaom <[email protected]> and contributors"]
version = "0.1.2"
version = "0.1.3"


[deps]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
Expand All @@ -24,6 +25,9 @@ Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TransformsBase = "28dd2a49-a57a-4bfb-84ca-1a49db9b96b8"

[compat]
LinearAlgebra="1.6"
Random="1.6"
Statistics="1.6"
CategoricalArrays = "0.10"
CategoricalDistributions = "0.1"
Clustering = "0.15"
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Xover, yover = transform(mach, X, y)
All implemented oversampling methods are considered static transforms and hence, no `fit` is required.

#### Pipelining Models
If `MLJBalancing` is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).
If [MLJBalancing](https://github.com/JuliaAI/MLJBalancing.jl) is also used, an arbitrary number of resampling methods from `Imbalance.jl` can be wrapped with a classification model from `MLJ` to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).

```julia
using MLJBalancing
Expand Down Expand Up @@ -147,4 +147,4 @@ One obvious possible remedy is to weight the smaller sums so that a learning alg
To our knowledge, there are no existing maintained Julia packages that implement resampling algorithms for multi-class classification problems or that handle both nominal and continuous features. This has served as a primary motivation for the creation of this package.

## 👥 Credits
This package was created by [Essam Wisam](https://github.com/JuliaAI) as a Google Summer of Code project, under the mentorship of [Anthony Blaom](https://ablaom.github.io). Special thanks also go to [Rik Huijzer](https://github.com/rikhuijzer) for his friendliness and the binary `SMOTE` implementation in `Resample.jl`.
This package was created by [Essam Wisam](https://github.com/JuliaAI) as a Google Summer of Code project, under the mentorship of [Anthony Blaom](https://ablaom.github.io). Special thanks also go to [Rik Huijzer](https://github.com/rikhuijzer) for his friendliness and the binary `SMOTE` implementation in `Resample.jl`.
4 changes: 3 additions & 1 deletion docs/src/algorithms/implementation_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@ Papers often propose the resampling algorithm for the case of binary classificat
### Generalizing to Real Ratios
Papers often proposes the resampling algorithm using integer ratios. For instance, a ratio of `2` would mean to double the amount of data in a class and a ratio of $2.2$ is not allowed or will be rounded. In `Imbalance.jl` any appropriate real ratio can be used and the ratio is relative to the size of the majority or minority class depending on whether the algorithm is oversampling or undersampling. The generalization occurs by randomly choosing points instead of looping on each point. That is, if a $2.2$ ratio corresponds to $227$ examples then $227$ examples are chosen randomly by replacement then applying resampling logic to each. Given an integer ratio $k$, this falls back to be on average equivalent to looping on the points $k$ times.

[1] López, V., Fernández, A., Moreno-Torres, J.G., & Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7), 6585-6608.
[1] Fernández, A., López, V., Galar, M., Del Jesus, M. J., and Herrera, F. (2013). Analysing the classifi-
cation of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches.
Knowledge-Based Systems, 42:97–110.
15 changes: 12 additions & 3 deletions docs/src/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,13 @@ Any method resampling method implemented in the `oversampling_methods` or `under
│ └── resample_method.jl # implements the method itself (pure functional interface)
```

# Adding New Resampling Methods
# Contribution


## Reporting Problems or Seeking Support
- Do not hesitate to post a Github issue with your question or problem.

## Adding New Resampling Methods
- Make a new folder `resample_method` for the method in the `oversampling_methods` or `undersampling_methods`
- Implement in `resample_method/resample_method.jl` the method over matrices for one minority class
- Use `generic_oversample.jl` to generalize it to work on the whole data
Expand All @@ -42,10 +48,13 @@ Surely, you can ignore ignore the third step if the algorithm you are implementi
- `BorderlineSMOTE2`: A small modification of the `BorderlineSMOTE1` condition
- `RepeatedENNUndersampler`: Simply repeats `ENNUndersampler` multiple times

# Adding New Tutorials

## Adding New Tutorials
- Make a new notebook with the tutorial in the `examples` folder found in `docs/src/examples`
- Run the notebook so that the output is shown below each cell
- If the notebook produces visuals then save and load them in the notebook
- Convert it to markdown by using Python to run `from convert import convert_to_md; convert_to_md('<filename>')`
- Set a title, description, image and links for it in the dictionary found in `docs/examples.jl`
- For the colab link, you do not need to upload anything just follow the link pattern in the file
- For the colab link, you do not need to upload anything just follow the link pattern in the file


2 changes: 1 addition & 1 deletion docs/src/examples/Colab.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Google Colab

It is possible to run tutorials found in the examples section or API documentation on Google colab. It should be evident how so by launching the notebook. This section describes what happens under the hood.
It is possible to run tutorials found in the examples section or API documentation on Google Colab (using provided link or icon). It should be evident how so by launching the notebook. This section describes what happens under the hood.

- The first cell runs the following bash script to install Julia:

Expand Down
3 changes: 1 addition & 2 deletions src/common/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ where that value occurs.
"""
function group_inds(categorical_array::AbstractVector{T}) where {T}
result = LittleDict{T,AbstractVector{Int}}()
freeze(result)
for (i, v) in enumerate(categorical_array)
# Make a new entry in the dict if it doesn't exist
if !haskey(result, v)
Expand All @@ -44,6 +43,6 @@ function group_inds(categorical_array::AbstractVector{T}) where {T}
# It exists, so push the index belonging to the class
push!(result[v], i)
end
return result
return freeze(result)
end

0 comments on commit f631a95

Please sign in to comment.