Skip to content

Commit

Permalink
Work on Equal_sum_vectors
Browse files Browse the repository at this point in the history
  • Loading branch information
PGS62 committed Apr 3, 2024
1 parent a66f137 commit 850a678
Show file tree
Hide file tree
Showing 5 changed files with 52 additions and 81 deletions.
20 changes: 7 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,10 @@ StatsBase versions with the versions from this package, as a follow-on from issu

<details><summary><u>Click for function documentation</u></summary>
<p>

```
corkendall(x, y=x; skipmissing::Symbol=:none)
Compute Kendall's rank correlation coefficient, τ. x and y must be either vectors or matrices, and entries may be missing.
Uses multiple threads when either x or y is a matrix.
Expand All @@ -41,8 +40,7 @@ StatsBase versions with the versions from this package, as a follow-on from issu
```
corspearman(x, y=x; skipmissing::Symbol=:none)
Compute Spearman's rank correlation coefficient. If x and y are vectors, the output is a float, otherwise it's a matrix corresponding to the pairwise correlations of
Compute Spearman's rank correlation coefficient. If x and y are vectors, the output is a float, otherwise it's a matrix corresponding to the pairwise correlations of
the columns of x and y.
Uses multiple threads when either x or y is a matrix and skipmissing is :pairwise.
Expand All @@ -59,11 +57,10 @@ StatsBase versions with the versions from this package, as a follow-on from issu
pairwise(f, x[, y];
symmetric::Bool=false, skipmissing::Symbol=:none)
Return a matrix holding the result of applying f to all possible pairs of entries in iterators x and y. Rows correspond to entries in x and columns to entries in y.
Return a matrix holding the result of applying f to all possible pairs of entries in iterators x and y. Rows correspond to entries in x and columns to entries in y.
If y is omitted then a square matrix crossing x with itself is returned.
As a special case, if f is cor, corspearman or corkendall, diagonal cells for which entries from x and y are identical (according to ===) are set to one even in the
As a special case, if f is cor, corspearman or corkendall, diagonal cells for which entries from x and y are identical (according to ===) are set to one even in the
presence missing, NaN or Inf entries.
Keyword arguments
Expand All @@ -72,8 +69,8 @@ StatsBase versions with the versions from this package, as a follow-on from issu
• symmetric::Bool=false: If true, f is only called to compute for the lower triangle of the matrix, and these values are copied to fill the upper triangle.
Only allowed when y is omitted and ignored (taken as true) if f is cov, cor, corkendall or corspearman.
• skipmissing::Symbol=:none: If :none (the default), missing values in inputs are passed to f without any modification. Use :pairwise to skip entries with a
missing value in either of the two vectors passed to f for a given pair of vectors in x and y. Use :listwise to skip entries with a missing value in any of
• skipmissing::Symbol=:none: If :none (the default), missing values in inputs are passed to f without any modification. Use :pairwise to skip entries with a
missing value in either of the two vectors passed to f for a given pair of vectors in x and y. Use :listwise to skip entries with a missing value in any of
the vectors in x or y; note that this might drop a large part of entries. Only allowed when entries in x and y are vectors.
Examples
Expand Down Expand Up @@ -223,8 +220,5 @@ julia> 3.848/0.121942
### `corspearman` performance against size of `x`
<img width="800" alt="image" src="plots/KendallTau vs StatsBase corspearman speed on 12 core 20 thread 21 March 2024.svg">


Philip Swannell
21 March 2024


21 March 2024
1 change: 0 additions & 1 deletion src/_notes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ I think the call stack described above is one layer too deep, thanks to new fn _
Speedup with fast kernel function (LinearAlgebra.dot)
julia> using StatsBase,KendallTau
julia> x = rand(1000,1000);
Expand Down
87 changes: 29 additions & 58 deletions src/pairwise.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ function _pairwise!(::Val{:none}, f, dest::AbstractMatrix{V}, x, y,
#cov(x) is faster than cov(x, x)
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y)))

Threads.@threads for subset in Equal_sum_subsets(nc, Threads.nthreads())
Threads.@threads for subset in Equal_sum_vectors(nc, Threads.nthreads())
for j in subset
for i = (symmetric ? j : 1):nr
# For performance, diagonal is special-cased
Expand Down Expand Up @@ -96,7 +96,7 @@ function _pairwise!(::Val{:pairwise}, f, dest::AbstractMatrix{V}, x, y, symmetri
nmtx = promoted_nmtype(x)[]
nmty = promoted_nmtype(y)[]

Threads.@threads for subset in Equal_sum_subsets(nc, Threads.nthreads())
Threads.@threads for subset in Equal_sum_vectors(nc, Threads.nthreads())
scratch_fx = task_local_vector(:scratch_fx, nmtx, m)
scratch_fy = task_local_vector(:scratch_fy, nmty, m)
for j in subset
Expand Down Expand Up @@ -393,44 +393,6 @@ function handle_pairwise(x::AbstractVector, y::AbstractVector;
return view(scratch_fx, lb:(j-1)), view(scratch_fy, lb:(j-1))
end

#TODO remove tsts for equal_sum_subsets, add tests for Equal_sum_subsets, rename to Equal_sum_vectors?

#=Condition a) makes equal_sum_subsets useful for load-balancing the multi-threaded section
of _pairwise! in the non-symmetric case, and condition b) for the symmetric case.=#
"""
equal_sum_subsets(n::Int, num_subsets::Int)::Vector{Vector{Int}}
Divide the integers 1:n into a number of subsets such that a) each subset has
(approximately) the same number of elements; and b) the sum of the elements in each subset
is nearly equal. If `n` is a multiple of `2 * num_subsets` both conditions hold exactly.
## Example
```julia-repl
julia> StatsBase.equal_sum_subsets(30,5)
5-element Vector{Vector{Int64}}:
[30, 21, 20, 11, 10, 1]
[29, 22, 19, 12, 9, 2]
[28, 23, 18, 13, 8, 3]
[27, 24, 17, 14, 7, 4]
[26, 25, 16, 15, 6, 5]
```
"""
function equal_sum_subsets(n::Int, num_subsets::Int)::Vector{Vector{Int}}
subsets = [Int[] for _ in 1:min(n, num_subsets)]
writeto, scanup = 1, true
for i = n:-1:1
push!(subsets[writeto], i)
if scanup && writeto == num_subsets
scanup = false
elseif (!scanup) && writeto == 1
scanup = true
else
writeto += scanup ? 1 : -1
end
end
return subsets
end

"""
task_local_vector(key::Symbol, similarto::AbstractArray{V},
length::Int)::Vector{V} where {V}
Expand All @@ -444,19 +406,16 @@ function task_local_vector(key::Symbol, similarto::AbstractArray{V},
return task_local_storage(key)
end

#Alternative approach - use an iterator.

"""
Equal_sum_subsets
Equal_sum_vectors
An iterator enabling the partition of the integers 1 to n into `num_subsets` vectors such
that a) each subset has (approximately) the same number of elements; and b) the sum of the
elements in each subset is nearly equal. If `n` is a multiple of `2 * num_subsets` both
conditions hold exactly.
An iterator that partitions the integers 1 to n into `num_vectors` vectors such that a) each
vector is of nearly equal length; and b) the sum of the elements in each vector is nearly
equal. If `n` is a multiple of `2 * num_vectors` both conditions hold exactly.
## Example
```julia-repl
julia> for s in KendallTau.Equal_sum_subsets(30,5)
julia> for s in KendallTau.Equal_sum_vectors(30,5)
println((s, sum(s)))
end
([30, 21, 20, 11, 10, 1], 93)
Expand All @@ -466,32 +425,44 @@ end
([26, 25, 16, 15, 6, 5], 93)
```
"""
struct Equal_sum_subsets
struct Equal_sum_vectors
n::Int64
num_subsets::Int64
num_vectors::Int64
end

Base.length(x::Equal_sum_subsets) = min(x.n, x.num_subsets)
Base.length(x::Equal_sum_vectors) = min(x.n, x.num_vectors)

Base.firstindex(x::Equal_sum_subsets) = 1
Base.firstindex(x::Equal_sum_vectors) = 1

function Base.iterate(ess::Equal_sum_subsets, state::Int64=1)
function Base.iterate(ess::Equal_sum_vectors, state::Int64=1)
state > length(ess) && return (nothing)
return (getindex(ess, state), state + 1)
end

function demo(n, num_subsets)
for s in Equal_sum_subsets(n, num_subsets)
function demo(n, num_vectors)
for s in Equal_sum_vectors(n, num_vectors)
println(s, sum(s))
end
end

function Base.getindex(ess::Equal_sum_subsets, i::Int64=1)
function Base.getindex(ess::Equal_sum_vectors, i::Int64=1)
i > length(ess) && return (nothing)
n, nss = ess.n, ess.num_subsets
n, nss = ess.n, ess.num_vectors
s = 2i - 1
b = 2nss - s
result = zeros(Int64, div(n, nss) + ((i <= rem(n, nss)) ? 1 : 0))

lresult = div(n, nss)
if mod(lresult, 2) == 1
if i > nss - rem(n, nss)
lresult += 1
end
else
if i <= rem(n, nss)
lresult += 1
end
end

result = zeros(Int64, lresult)
result[1] = n - i + 1
k = 1
while true
Expand Down
4 changes: 2 additions & 2 deletions src/rankcorr.jl
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ function _pairwise!(::Val{:pairwise}, f::typeof(corspearman),
fl64 = Float64[]
nmtx = promoted_nmtype(x)[]
nmty = promoted_nmtype(y)[]
Threads.@threads for subset in Equal_sum_subsets(nr, Threads.nthreads())
Threads.@threads for subset in Equal_sum_vectors(nr, Threads.nthreads())

for i in subset

Expand Down Expand Up @@ -482,7 +482,7 @@ function corkendall_loop!(skipmissing::Symbol, f::typeof(corkendall), dest::Abst

symmetric = x === y

Threads.@threads for subset in Equal_sum_subsets(nr, Threads.nthreads())
Threads.@threads for subset in Equal_sum_vectors(nr, Threads.nthreads())

for i in subset

Expand Down
21 changes: 14 additions & 7 deletions test/rankcorr.jl
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,16 @@ using Test
@test KendallTau.midpoint(1, 10) == 5
@test KendallTau.midpoint(1, widen(10)) == 5

@test KendallTau.equal_sum_subsets(0, 1) == Vector{Int64}[]
@test sum.(KendallTau.equal_sum_subsets(100, 5)) == repeat([1010], 5)
@test sort(vcat(KendallTau.equal_sum_subsets(500, 7)...)) == collect(1:500)
for n in 1:50, nss in 1:5
#check is a partition
@test sort(vcat([s for s in KendallTau.Equal_sum_vectors(n, nss)]...)) == 1:n
#check near-equal lengths
lengths = [length(s) for s in KendallTau.Equal_sum_vectors(n, nss)]
@test (maximum(lengths) - minimum(lengths)) <= 1
#check near-equal sums
sums = [sum(s) for s in KendallTau.Equal_sum_vectors(n, nss)]
@test (maximum(sums) - minimum(sums)) < nss
end

end

Expand Down Expand Up @@ -362,11 +369,11 @@ end
factor" of 1.2 against the expected size of allocations.
=#
@test (@allocated corkendall(x)) < (897_808 + Threads.nthreads() * 58_044) * 1.2
@test (@allocated corkendall(xm,skipmissing=:listwise)) < (1_119_392 + Threads.nthreads() * 22_172) * 1.2
@test (@allocated corkendall(xm,skipmissing=:pairwise)) < (892_112 + Threads.nthreads() * 61_116) * 1.2
@test (@allocated corkendall(xm, skipmissing=:listwise)) < (1_119_392 + Threads.nthreads() * 22_172) * 1.2
@test (@allocated corkendall(xm, skipmissing=:pairwise)) < (892_112 + Threads.nthreads() * 61_116) * 1.2
@test (@allocated corspearman(x)) < (2_678_448 + Threads.nthreads() * 9_128) * 1.2
@test (@allocated corspearman(xm,skipmissing=:listwise)) < (1_803_712 + Threads.nthreads() * 3_992) * 1.2
@test (@allocated corspearman(xm,skipmissing=:pairwise)) < (1_692_208 + Threads.nthreads() * 67_172) * 1.2
@test (@allocated corspearman(xm, skipmissing=:listwise)) < (1_803_712 + Threads.nthreads() * 3_992) * 1.2
@test (@allocated corspearman(xm, skipmissing=:pairwise)) < (1_692_208 + Threads.nthreads() * 67_172) * 1.2

end
# COV_EXCL_STOP

0 comments on commit 850a678

Please sign in to comment.