Skip to content

Commit

Permalink
Refactoring
Browse files Browse the repository at this point in the history
Tweaks to function names, doc strings and tests.
  • Loading branch information
PGS62 committed Feb 5, 2024
1 parent 75c8701 commit 6c727c0
Show file tree
Hide file tree
Showing 8 changed files with 161 additions and 102 deletions.
1 change: 0 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ version = "3.0.0"
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
NamedArrays = "86f7a689-2022-50b4-a561-43c23ac3c673"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

Expand Down
52 changes: 52 additions & 0 deletions scripts/performancetestresults.txt
Original file line number Diff line number Diff line change
Expand Up @@ -485,4 +485,56 @@ A ╲ B │ numcols(x) StatsBase.corkendall KendallTau.corkendall
6 │ 64.0 0.069224 0.0070644
7 │ 128.0 0.285385 0.0280557
8 │ 256.0 1.1547 0.110696
###################################################################


julia> how_scaleable(fns,1000,2 .^ (1:12),false,false,true)
###################################################################
Executing how_scaleable 2024-02-05T12:08:53.214
ComputerName = DESKTOP-HSGAM5S
fns[1] = StatsBase.corkendall
fns[2] = KendallTau.corkendall
ncs = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]
with_missings = false
use_benchmark_tools = false
Threads.nthreads() = 20
nc = 2, nr = 1000, f = StatsBase.corkendall, time = 0.0011769
nc = 2, nr = 1000, f = KendallTau.corkendall, time = 0.0103372, ratio = 0.11385094609758929
nc = 4, nr = 1000, f = StatsBase.corkendall, time = 0.0004849
nc = 4, nr = 1000, f = KendallTau.corkendall, time = 0.0009406, ratio = 0.5155220072294281
nc = 8, nr = 1000, f = StatsBase.corkendall, time = 0.0019536
nc = 8, nr = 1000, f = KendallTau.corkendall, time = 0.0027633, ratio = 0.7069807838454024
nc = 16, nr = 1000, f = StatsBase.corkendall, time = 0.0084139
nc = 16, nr = 1000, f = KendallTau.corkendall, time = 0.0024703, ratio = 3.406023559891511
nc = 32, nr = 1000, f = StatsBase.corkendall, time = 0.0386484
nc = 32, nr = 1000, f = KendallTau.corkendall, time = 0.0053999, ratio = 7.1572436526602345
nc = 64, nr = 1000, f = StatsBase.corkendall, time = 0.4622757
nc = 64, nr = 1000, f = KendallTau.corkendall, time = 0.0098453, ratio = 46.95394756889074
nc = 128, nr = 1000, f = StatsBase.corkendall, time = 0.2959822
nc = 128, nr = 1000, f = KendallTau.corkendall, time = 0.0306628, ratio = 9.652810571767743
nc = 256, nr = 1000, f = StatsBase.corkendall, time = 1.1365017
nc = 256, nr = 1000, f = KendallTau.corkendall, time = 0.1254557, ratio = 9.058988152790187
nc = 512, nr = 1000, f = StatsBase.corkendall, time = 4.5060678
nc = 512, nr = 1000, f = KendallTau.corkendall, time = 0.4610122, ratio = 9.774291873403785
nc = 1024, nr = 1000, f = StatsBase.corkendall, time = 18.2435851
nc = 1024, nr = 1000, f = KendallTau.corkendall, time = 1.8419572, ratio = 9.904456574778177
nc = 2048, nr = 1000, f = StatsBase.corkendall, time = 73.7546561
nc = 2048, nr = 1000, f = KendallTau.corkendall, time = 7.707059, ratio = 9.569753663492131
nc = 4096, nr = 1000, f = StatsBase.corkendall, time = 298.6694524
nc = 4096, nr = 1000, f = KendallTau.corkendall, time = 30.1120506, ratio = 9.918602235611282
12×3 Named Matrix{Float64}
A ╲ B │ numcols(x) StatsBase.corkendall KendallTau.corkendall
──────┼────────────────────────────────────────────────────────────────────
1 │ 2.0 0.0011769 0.0103372
2 │ 4.0 0.0004849 0.0009406
3 │ 8.0 0.0019536 0.0027633
4 │ 16.0 0.0084139 0.0024703
5 │ 32.0 0.0386484 0.0053999
6 │ 64.0 0.462276 0.0098453
7 │ 128.0 0.295982 0.0306628
8 │ 256.0 1.1365 0.125456
9 │ 512.0 4.50607 0.461012
10 │ 1024.0 18.2436 1.84196
11 │ 2048.0 73.7547 7.70706
12 │ 4096.0 298.669 30.1121
###################################################################
2 changes: 1 addition & 1 deletion scripts/performancetests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@ function how_scaleable(fns, nr::Integer, ncs::Vector{<:Integer},
marker=:circle,
xaxis=:log,
yaxis=:log,
size=(1500, 1000),
size=(900, 600),
grid=true)

end
16 changes: 0 additions & 16 deletions scripts/soaktest.jl

This file was deleted.

105 changes: 39 additions & 66 deletions src/corkendall.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,71 +8,35 @@
const RoMVector{T<:Real} = AbstractVector{<:Union{T,Missing}}
const RoMMatrix{T<:Real} = AbstractMatrix{<:Union{T,Missing}}

#= TODO 27 July 2023
Review use of threads in light of
https://julialang.org/blog/2023/07/PSA-dont-use-threadid/#another_option_use_a_package_which_handles_this_correctly
=#

#= TODO 22 Feb 2023
How to get compatibility of corkendall with StatsBase.pairwise? The problem is that
pairwise passes vectors to f that don't contain missing but for which missing isa eltype
and corkendall then wants a skipmissing argument.
Option a)
Amend StatsBase._pairwise! to replace line:
dest[i, j] = f(ynm, ynm)
with:
dest[i, j] = f(disallowmissing(ynm), disallowmissing(ynm))
Option b)
Some other way to arrange that arguments passed to f from within StatsBase._pairwise!
do not have Missing as an allowed element type. Using function handle_pairwise! would do
that efficiently.
Option c)
In corkendall vector-vector method, if missing is an element type of both x and of y, but
missing does not appear in either x or y, then call disallowmissing twice, like this:
if missing isa eltype(x) && missing isa eltype(y)
if !any(ismissing,x) && !any(ismissing,y)
x = disallowmissing(x)
y = disallowmissing(y)
end
end
Option d)
Have a dedicated method of _pairwise! to handle f === corkendall. This has a big
performance advantage, and is maybe along the lines suggested by nalimilan here:
https://github.com/JuliaStats/StatsBase.jl/pull/647#issuecomment-775264454
=#

"""
corkendall(x, y=x; skipmissing::Symbol=:none)
Compute Kendall's rank correlation coefficient, τ. `x` and `y` must be either vectors or
matrices, with elements that are either real numbers or `missing`. The function uses
multiple threads if they are available.
matrices, with elements that are either real numbers or `missing`. Uses multiple threads if
available.
# Keyword argument
- `skipmissing::Symbol=:none`: If `:none` (the default), then `missing` entries in `x` or
`y` lead to `missing` entries in the return. If `:pairwise`, when either of the `i`th
entries of the vectors used to calculate an element of the return is `missing`, both
entries are skipped. If `:listwise`, when any entry in the `i`th row of `x` or the `i`th
row of `y` is `missing` then all entries in the `i`th rows are skipped; note that this
might skip a high proportion of entries. Only allowed when `x` or `y` is a matrix.
`y` give rise to `missing` entries in the return. If `:pairwise`, when either of the
`i`th entries of the vectors required to calculate an element of the return is `missing`,
both entries are skipped. If `:listwise`, when any entry in the `i`th row of `x` or the
`i`th row of `y` is `missing` then the entire `i`th rows are skipped; note that
this might skip a high proportion of entries. Only allowed when `x` or `y` is a matrix.
"""
function corkendall(x::RoMMatrix{T}, y::RoMMatrix{U}=x; skipmissing::Symbol=:none) where {T,U}
function corkendall(x::RoMMatrix{T}, y::RoMMatrix{U}=x;
skipmissing::Symbol=:none) where {T,U}

Base.require_one_based_indexing(x, y)

size(x, 1) == size(y, 1) || throw(DimensionMismatch("x and y have inconsistent dimensions"))
size(x, 1) == size(y, 1) || throw(DimensionMismatch("x and y have
inconsistent dimensions"))

symmetric = x === y

missing_allowed = missing isa eltype(x) || missing isa eltype(y)
skipmissing in [:none, :pairwise, :listwise] || throw(ArgumentError("skipmissing must be one of :none, :pairwise or :listwise"))
skipmissing in [:none, :pairwise, :listwise] || throw(ArgumentError("skipmissing must \
be one of :none, :pairwise or :listwise, but got :$skipmissing"))

# Degenerate case - U and/or T not defined.
if x isa Matrix{Missing} || y isa Matrix{Missing}
Expand All @@ -92,7 +56,7 @@ function corkendall(x::RoMMatrix{T}, y::RoMMatrix{U}=x; skipmissing::Symbol=:non

if missing_allowed
if skipmissing == :listwise
x, y = handle_listwise!(x, y)
x, y = handle_listwise(x, y)
end
end

Expand Down Expand Up @@ -154,7 +118,8 @@ function corkendall(x::RoMMatrix{T}, y::RoMMatrix{U}=x; skipmissing::Symbol=:non

for i = 1:(symmetric ? j - 1 : nc)
ycoli .= view(y, :, i)
C[j, i] = corkendall_kernel!(sortedxcolj, ycoli, permx, scratch_py, scratch_sy, scratch_fx, scratch_fy, skipmissing)
C[j, i] = corkendall_kernel!(sortedxcolj, ycoli, permx, scratch_py,
scratch_sy, scratch_fx, scratch_fy, skipmissing)
symmetric && (C[i, j] = C[j, i])
end
end
Expand All @@ -165,10 +130,12 @@ function corkendall(x::RoMVector{T}, y::RoMVector{U}; skipmissing::Symbol=:none)

Base.require_one_based_indexing(x, y)

length(x) == length(y) || throw(DimensionMismatch("x and y have inconsistent dimensions"))
length(x) == length(y) || throw(DimensionMismatch("x and y have \
inconsistent dimensions"))

missing_allowed = missing isa eltype(x) || missing isa eltype(y)
skipmissing in [:none, :pairwise] || throw(ArgumentError("skipmissing must be one of :none or :pairwise"))
skipmissing in [:none, :pairwise] || throw(ArgumentError("skipmissing must be one of \
:none or :pairwise, but got :$skipmissing"))

if missing_allowed && skipmissing == :none
if any(ismissing, x) || any(ismissing, y)
Expand All @@ -182,13 +149,14 @@ function corkendall(x::RoMVector{T}, y::RoMVector{U}; skipmissing::Symbol=:none)
x = copy(x)

if missing_allowed && skipmissing == :pairwise
x, y = handle_pairwise!(x, y, similar(x, T), similar(y, U))
x, y = handle_pairwise(x, y)
end

permx = sortperm(x)
permute!(x, permx)

return corkendall_kernel!(x, y, permx, similar(y), similar(y), similar(x, T), similar(y, U), skipmissing)
return corkendall_kernel!(x, y, permx, similar(y), similar(y),
similar(x, T), similar(y, U), skipmissing)
end

#= corkendall returns a vector in this case, inconsistent with with Statistics.cor and
Expand Down Expand Up @@ -243,7 +211,7 @@ function corkendall_kernel!(sortedx::RoMVector{T}, y::RoMVector{U},
end

if missing isa eltype(sortedx) || missing isa eltype(scratch_py)
sortedx, permutedy = handle_pairwise!(sortedx, scratch_py, scratch_fx, scratch_fy)
sortedx, permutedy = handle_pairwise(sortedx, scratch_py; scratch_fx, scratch_fy)
else
permutedy = scratch_py
end
Expand All @@ -262,9 +230,11 @@ function corkendall_kernel!(sortedx::RoMVector{T}, y::RoMVector{U},
if sortedx[i-1] == sortedx[i]
k += 1
elseif k > 0
# Sort the corresponding chunk of permutedy, so the rows of hcat(sortedx,permutedy)
# are sorted first on sortedx, then (where sortedx values are tied) on permutedy.
# Hence double ties can be counted by calling countties.
#=
Sort the corresponding chunk of permutedy, so rows of hcat(sortedx,permutedy)
are sorted first on sortedx, then (where sortedx values are tied) on permutedy.
Hence double ties can be counted by calling countties.
=#
sort!(view(permutedy, (i-k-1):(i-1)))
ntiesx += div(widen(k) * (k + 1), 2) # Must use wide integers here
ndoubleties += countties(permutedy, i - k - 1, i - 1)
Expand Down Expand Up @@ -402,14 +372,16 @@ function insertion_sort!(v::AbstractVector, lo::Integer, hi::Integer)
end

"""
handle_pairwise!(x::RoMVector{T}, y::RoMVector{U},
handle_pairwise(x::RoMVector{T}, y::RoMVector{U},
scratch_fx::AbstractVector{T}, scratch_fy::AbstractVector{U}) where {T,U}
Return a pair `(a,b)`, filtered copies of `(x,y)`, in which elements `x[i]` and
`y[i]` are filtered out if `ismissing(x[i])||ismissing(y[i])`.
`y[i]` are excluded if `ismissing(x[i])||ismissing(y[i])`. The element types of `a` and
`b` are `T` and `U` so that `Missing` is not an element type of either.
"""
function handle_pairwise!(x::RoMVector{T}, y::RoMVector{U},
scratch_fx::AbstractVector{T}, scratch_fy::AbstractVector{U}) where {T,U}
function handle_pairwise(x::RoMVector{T}, y::RoMVector{U};
scratch_fx::AbstractVector{T}=similar(x, T),
scratch_fy::AbstractVector{U}=similar(y, U)) where {T,U}

j = 0

Expand All @@ -425,12 +397,13 @@ function handle_pairwise!(x::RoMVector{T}, y::RoMVector{U},
end

"""
handle_listwise!(x::RoMMatrix{T}, y::RoMMatrix{U}) where {T,U}
handle_listwise(x::RoMMatrix{T}, y::RoMMatrix{U}) where {T,U}
Return a pair `(a,b)`, filtered copies of `(x,y)`, in which the rows `x[i,:]` and
`y[i,:]` are both filtered out if `any(ismissing,x[i,:])||any(ismissing,y[i,:])`.
`y[i,:]` are both excluded if `any(ismissing,x[i,:])||any(ismissing,y[i,:])`. The element
types of `a` and `b` are `T` and `U` so that `Missing` is not an element type of either.
"""
function handle_listwise!(x::RoMMatrix{T}, y::RoMMatrix{U}) where {T,U}
function handle_listwise(x::RoMMatrix{T}, y::RoMMatrix{U}) where {T,U}

nrx = size(x, 1)
nry = size(y, 1)
Expand Down
1 change: 0 additions & 1 deletion src/corkendall_fromfile.jl
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,6 @@ end
function csvread2(filename::String, ignorefirstrow::Bool, ignorefirstcol::Bool;
missingstring::Union{Nothing,String,Vector{String}}="")


contents = DelimitedFiles.readdlm(filename, ',')
contents = ifelse.(contents .== missingstring, missing, contents)
firstrow = ignorefirstrow ? 2 : 1
Expand Down
38 changes: 38 additions & 0 deletions src/notes.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#= TODO 27 July 2023
Review use of threads in light of
https://julialang.org/blog/2023/07/PSA-dont-use-threadid/#another_option_use_a_package_which_handles_this_correctly
=#

#= TODO 22 Feb 2023
How to get compatibility of corkendall with StatsBase.pairwise? The problem is that
pairwise passes vectors to f that don't contain missing but for which missing isa eltype
and corkendall then wants a skipmissing argument.
Option a)
Amend StatsBase._pairwise! to replace line:
dest[i, j] = f(ynm, ynm)
with:
dest[i, j] = f(disallowmissing(ynm), disallowmissing(ynm))
Option b)
Some other way to arrange that arguments passed to f from within StatsBase._pairwise!
do not have Missing as an allowed element type. Using function handle_pairwise would do
that efficiently.
Option c)
In corkendall vector-vector method, if missing is an element type of both x and of y, but
missing does not appear in either x or y, then call disallowmissing twice, like this:
if missing isa eltype(x) && missing isa eltype(y)
if !any(ismissing,x) && !any(ismissing,y)
x = disallowmissing(x)
y = disallowmissing(y)
end
end
Option d)
Have a dedicated method of _pairwise! to handle f === corkendall. This has a big
performance advantage, and is maybe along the lines suggested by nalimilan here:
https://github.com/JuliaStats/StatsBase.jl/pull/647#issuecomment-775264454
=#
Loading

0 comments on commit 6c727c0

Please sign in to comment.