Skip to content

Commit

Permalink
README changes
Browse files Browse the repository at this point in the history
  • Loading branch information
PGS62 committed Mar 19, 2024
1 parent df8005a commit c5226de
Show file tree
Hide file tree
Showing 3 changed files with 159 additions and 22 deletions.
175 changes: 155 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,37 +4,108 @@

# KendallTau.jl

This unregistered package exports functions `corkendall` and `corkendall_fromfile` for the calculation of Kendall's τ coefficient. See [Tau-b](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) on Wikipedia. The [StatsBase](https://github.com/JuliaStats/StatsBase.jl) package has a function of the same name that was contributed from this package on 8 February 2021 (issue [634](https://github.com/JuliaStats/StatsBase.jl/issues/634), commit [647](https://github.com/JuliaStats/StatsBase.jl/commit/11ac5b596405367b3217d3d962e22523fef9bb0d)).

Since then, `KendallTau.corkendall` has improved in two ways:
This unregistered package exports four function, which will be proposed as candidates to replace functions of the same name in StatsBase:

- The function is now multi-threaded. On a PC with 12 cores, it's about 14 times faster than the current StatsBase version.
- There is now a `skipmissing` keyword argument to control the treatment of missing values, along the lines of the `skipmissing` argument to `StatsBase.pairwise`.
* `corkendall`, for the calculation of Kendall's τ coefficient.
* `corspearman`, for the calculation of Spearman correlation.
* `pairwise` and `pairwise!` which apply a function `f` to all possible pairs of entries in iterators `x` and `y`.

There is an open [issue](https://github.com/JuliaStats/StatsBase.jl/issues/849) in StatsBase to bring these two improvements to `StatsBase.corkendall`, after which time this package will be largely redundant.

### Help
```
help?> KendallTau.corkendall
corkendall(x, y=x; skipmissing::Symbol=:none)
Compute Kendall's rank correlation coefficient, τ. x and y must be either vectors or matrices, and
entries may be missing.
Compute Kendall's rank correlation coefficient, τ. x and y must be either vectors or matrices, and entries may be missing.
Uses multiple threads when either x or y is a matrix.
Keyword argument
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
• skipmissing::Symbol=:none: If :none (the default), missing entries in x or y give rise to
missing entries in the return. If :pairwise when calculating an element of the return, both
ith entries of the input vectors are skipped if either is missing. If :listwise the ith rows
of both x and y are skipped if missing appears in either; note that this might skip a high
proportion of entries. Only allowed when x or y is a matrix.
• skipmissing::Symbol=:none: If :none (the default), missing entries in x or y give rise to missing entries in the return. If :pairwise when calculating an
element of the return, both ith entries of the input vectors are skipped if either is missing. If :listwise the ith rows of both x and y are skipped if
missing appears in either; note that this might skip a high proportion of entries. Only allowed when x or y is a matrix.
```

## Performance
Note the reduction in number and size of allocations. This was key to obtaining the full benefit of multi-threading.
```
corspearman(x, y=x; skipmissing::Symbol=:none)
Compute Spearman's rank correlation coefficient. If x and y are vectors, the output is a float, otherwise it's a matrix corresponding to the pairwise correlations of
the columns of x and y.
Uses multiple threads when either x or y is a matrix and skipmissing is :pairwise.
Keyword argument
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
• skipmissing::Symbol=:none: If :none (the default), missing entries in x or y give rise to missing entries in the return. If :pairwise when calculating an
element of the return, both ith entries of the input vectors are skipped if either is missing. If :listwise the ith rows of both x and y are skipped if
missing appears in either; note that this might skip a high proportion of entries. Only allowed when x or y is a matrix.
```

```
pairwise(f, x[, y];
symmetric::Bool=false, skipmissing::Symbol=:none)
Return a matrix holding the result of applying f to all possible pairs of entries in iterators x and y. Rows correspond to entries in x and columns to entries in y.
If y is omitted then a square matrix crossing x with itself is returned.
As a special case, if f is cor, corspearman or corkendall, diagonal cells for which entries from x and y are identical (according to ===) are set to one even in the
presence missing, NaN or Inf entries.
Keyword arguments
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
• symmetric::Bool=false: If true, f is only called to compute for the lower triangle of the matrix, and these values are copied to fill the upper triangle.
Only allowed when y is omitted and ignored (taken as true) if f is cov, cor, corkendall or corspearman.
• skipmissing::Symbol=:none: If :none (the default), missing values in inputs are passed to f without any modification. Use :pairwise to skip entries with a
missing value in either of the two vectors passed to f for a given pair of vectors in x and y. Use :listwise to skip entries with a missing value in any of
the vectors in x or y; note that this might drop a large part of entries. Only allowed when entries in x and y are vectors.
Examples
≡≡≡≡≡≡≡≡
julia> using KendallTau, Statistics
julia> x = [1 3 7
2 5 6
3 8 4
4 6 2];
julia> pairwise(cor, eachcol(x))
3×3 Matrix{Float64}:
1.0 0.744208 -0.989778
0.744208 1.0 -0.68605
-0.989778 -0.68605 1.0
julia> y = [1 3 missing
2 5 6
3 missing 2
4 6 2];
julia> pairwise(cor, eachcol(y), skipmissing=:pairwise)
3×3 Matrix{Float64}:
1.0 0.928571 -0.866025
0.928571 1.0 -1.0
-0.866025 -1.0 1.0
```
<!--
This unregistered package exports functions `corkendall` and `corkendall_fromfile` for the calculation of Kendall's τ coefficient. See [Tau-b](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) on Wikipedia. The [StatsBase](https://github.com/JuliaStats/StatsBase.jl) package has a function of the same name that was contributed from this package on 8 February 2021 (issue [634](https://github.com/JuliaStats/StatsBase.jl/issues/634), commit [647](https://github.com/JuliaStats/StatsBase.jl/commit/11ac5b596405367b3217d3d962e22523fef9bb0d)).
Since then, `KendallTau.corkendall` has improved in two ways:
- The function is now multi-threaded. On a PC with 12 cores, it's about 14 times faster than the current StatsBase version.
- There is now a `skipmissing` keyword argument to control the treatment of missing values, along the lines of the `skipmissing` argument to `StatsBase.pairwise`.
There is an open [issue](https://github.com/JuliaStats/StatsBase.jl/issues/849) in StatsBase to bring these two improvements to `StatsBase.corkendall`, after which time this package will be largely redundant.
-->

## `corkendall` performance
```julia
julia> using StatsBase, KendallTau, Random, BenchmarkTools #StatsBase v0.34.2

Expand Down Expand Up @@ -71,9 +142,73 @@ Platform Info:
LLVM: libLLVM-15.0.7 (ORCJIT, alderlake)
Threads: 29 on 20 virtual cores
```
<!--
TODO Update using work 12-core PC
-->
## `corspearman` performance
```
julia> using StatsBase, KendallTau, Random, BenchmarkTools #StatsBase v0.34.2
julia> x = rand(1000,10);StatsBase.corspearman(x)==KendallTau.corspearman(x)#compile
true
julia> x = rand(1000,1000);
julia> res_sb = @btime StatsBase.corspearman(x);
29.494 s (3503503 allocations: 11.44 GiB)
julia> res_kt = @btime KendallTau.corspearman(x);
46.774 ms (1127 allocations: 39.31 MiB)
julia> res_kt == res_sb
true
julia> 29.494/.046774
630.5639885406422
```


## `pairwise` performance
```
julia> using StatsBase, KendallTau, Random, BenchmarkTools, LinearAlgebra #StatsBase v0.34.2
julia> x = rand(1000,10); xm = ifelse.(x .< .05, missing, x);
julia> KendallTau.pairwise(LinearAlgebra.dot,eachcol(xm),skipmissing=:pairwise)≈StatsBase.pairwise(LinearAlgebra.dot,eachcol(xm),skipmissing=:pairwise)#compile
true
julia> x = rand(1000,1000); xm = ifelse.(x .< .05, missing, x);
julia> res_kt = @btime KendallTau.pairwise(LinearAlgebra.dot,eachcol(xm),skipmissing=:pairwise);
617.629 ms (3000153 allocations: 114.59 MiB)
julia> res_sb = @btime StatsBase.pairwise(LinearAlgebra.dot,eachcol(xm),skipmissing=:pairwise);
8.378 s (4999007 allocations: 17.95 GiB)
julia> res_kt≈res_sb
true
julia> 8.378/0.617629
13.564777560639154
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, sandybridge)
Threads: 11 on 8 virtual cores
```

### Performance against size of `x`
### `corkendall` performance against size of `x`
<img width="800" alt="image" src="plots/KendallTau vs StatsBase corkendall speed on 12 core 20 thread 15 Feb 2024.svg">

Philip Swannell
15 February 2024
19 March 2024
2 changes: 2 additions & 0 deletions src/_notes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -102,5 +102,7 @@ true
julia> maximum(abs.(res_sb.-res_kt))
1.0800249583553523e-12
julia> 115.311/22.138
5.208736109856355
=#
4 changes: 2 additions & 2 deletions src/pairwise.jl
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ presence `missing`, `NaN` or `Inf` entries.
# Examples
```jldoctest
julia> using StatsBase, Statistics
julia> using KendallTau, Statistics
julia> dest = zeros(3, 3);
Expand Down Expand Up @@ -287,7 +287,7 @@ presence `missing`, `NaN` or `Inf` entries.
# Examples
```jldoctest
julia> using StatsBase, Statistics
julia> using KendallTau, Statistics
julia> x = [1 3 7
2 5 6
Expand Down

0 comments on commit c5226de

Please sign in to comment.