Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLM fails to match certified values for the Filip dataset #558

Open
mousum-github opened this issue May 5, 2024 · 5 comments
Open

GLM fails to match certified values for the Filip dataset #558

mousum-github opened this issue May 5, 2024 · 5 comments

Comments

@mousum-github
Copy link
Collaborator

mousum-github commented May 5, 2024

filip.csv

In response to industrial concerns about the numerical accuracy of computations from statistical software, the Statistical Engineering and Mathematical and Computational Sciences Divisions of NIST's Information Technology Laboratory are providing datasets with certified values for various statistical methods.
One of them for linear regression is the Filip dataset.
https://www.itl.nist.gov/div898/strd/lls/data/Filip.shtml

For this dataset, the lm function with the QR decomposition method fails to estimate all parameters

LinearModel

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                    Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   8.13378        9.54815        0.85    0.3971  -10.9001        27.1677
x             0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        -7.14418       14.9533        -0.48    0.6343  -36.953         22.6647
x ^ 3        -4.53337       14.5308        -0.31    0.7560  -33.5           24.4333
x ^ 4        -0.881151       6.84421       -0.13    0.8979  -14.5248        12.7625
x ^ 5         0.135741       1.93585        0.07    0.9443   -3.7233         3.99479
x ^ 6         0.0989815      0.351359       0.28    0.7790   -0.601441       0.799403
x ^ 7         0.0207993      0.0413984      0.50    0.6169   -0.0617269      0.103326
x ^ 8         0.0022362      0.00307053     0.73    0.4688   -0.0038848      0.0083572
x ^ 9         0.000124681    0.000130509    0.96    0.3426   -0.000135484    0.000384846
x ^ 10.0      2.86391e-6     2.42701e-6     1.18    0.2419   -1.97424e-6     7.70207e-6

whereas the lm function with the Cholesky decomposition method fails to estimate 6 parameters

LinearModel

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                   Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0          NaN            NaN       NaN     NaN            NaN
x            0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 3        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 4        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 5        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 6        0.00368644     0.000219028   16.83    <1e-26    0.0032503      0.00412258
x ^ 7        0.00191731     0.000127438   15.05    <1e-23    0.00166355     0.00217107
x ^ 8        0.000375885    2.75293e-5    13.65    <1e-21    0.000321067    0.000430703
x ^ 9        3.28191e-5     2.61667e-6    12.54    <1e-19    2.76087e-5     3.80296e-5
x ^ 10       1.07467e-6     9.23624e-8    11.64    <1e-17    8.90753e-7     1.25859e-6
───────────────────────────────────────────────────────────────────────────────────────

The following table shows that the estimates from the lm function in R are close to the certified values.

  Certified Values R Values Julia (QR) Julia (Cholesky)
Beta_0 -1467.4896142298000000 -1467.4895242210000000 8.1337808176521600 0.0000000000000000
Beta_1 -2772.1795919334200000 -2772.1794222850000000 0.0000000000000000 0.0000000000000000
Beta_2 -2316.3710816089300000 -2316.3709402080000000 -7.1441779010897300 0.0000000000000000
Beta_3 -1127.9739409837200000 -1127.9738723330000000 -4.5333687533072200 0.0000000000000000
Beta_4 -354.4782337033490000 -354.4782121956000000 -0.8811512968692300 0.0000000000000000
Beta_5 -75.1242017393757000 -75.1241971938900000 0.1357405515617050 0.0000000000000000
Beta_6 -10.8753180355343000 -10.8753173788800000 0.0989814591132414 0.0036864417075102
Beta_7 -1.0622149858894700 -1.0622149218240000 0.0207993361386816 0.0019173118731864
Beta_8 -0.0670191154593408 -0.0670191114167300 0.0022361990171227 0.0003758850326059
Beta_9 -0.0024678107827548 -0.0024678106336760 0.0001246810033160 0.0000328191326413
Beta_10 -0.0000402962525080 -0.0000402962500667 0.0000028639148280 0.0000010746699638
@mousum-github
Copy link
Collaborator Author

Moreover,

df_filip = CSV.read("filip.csv", DataFrame);
model_filip_qr = lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr,);
X =  modelmatrix(model_fil_qr);
y = response(model_fil_qr);

rank(X)
# 10
rank(X'X)
# 5

It might be because of the limitation of LINPACK.
Now let's try the same with BigFloat:

using GenericLinearAlgebra
X =  big.(modelmatrix(model_filip_qr))
rank(X)
# 11
rank(X'X)
# 11

and finally,

inv(X'X)*X'y

11-element Vector{BigFloat}:
 -1467.4896101225341941
 -2772.1795838225322676
 -2316.3710745083318604
 -1127.9739373552426084
  -354.47823250481932410
   -75.1242014719792427
   -10.8753179947227183
    -1.0622149816811447
    -0.0670191151786945
    -0.0024678107718217
    -4.0296252319e-05

@mousum-github mousum-github changed the title GLM fails to match certified values for Filip dataset GLM fails to match certified values for the Filip dataset May 5, 2024
@andreasnoack
Copy link
Member

andreasnoack commented May 6, 2024

I don't think it's surprising that a coefficient is dropped since the default is dropcollinear = true and

julia> cond(X)
1.7679681910558162e15

It's surprising, though, that we error out when setting dropcollinear = false

julia> lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr, dropcollinear=false)
ERROR: RankDeficientException(10)
Stacktrace:
 [1] delbeta!(p::GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}, r::Vector{Float64})
   @ GLM ~/.julia/dev/GLM/src/linpred.jl:66
 [2] fit!(obj::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}})
   @ GLM ~/.julia/dev/GLM/src/lm.jl:97
 [3] fit(::Type{…}, f::FormulaTerm{…}, data::DataFrame, allowrankdeficient_dep::Nothing; wts::Nothing, dropcollinear::Bool, method::Symbol, contrasts::Dict{…})
   @ GLM ~/.julia/dev/GLM/src/lm.jl:0
 [4] fit
   @ ~/.julia/dev/GLM/src/lm.jl:149 [inlined]
 [5] lm
   @ ~/.julia/dev/GLM/src/lm.jl:179 [inlined]
 [6] top-level scope
   @ REPL[42]:1
Some type information was truncated. Use `show(err)` to see complete types.

The problem is the check in

rnk == length(p.delbeta) || throw(RankDeficientException(rnk))
which I don't think is a good idea. If the user has set dropcollinear = false then we shouldn't try to outsmart her by computing a numerical rank. We should just check that there isn't a zero element in the diagonal of R.

However, to my surprise, it looks like fixing this still isn't sufficient to match the reference estimates. It looks like LAPACK's pivoted QR loses some precision relative to the standard QR.

julia> qr(X, ColumnNorm()) \ df_filip.y
11-element Vector{Float64}:
  9.01342654713381
  1.6525459423443027
 -5.767606630529379
 -3.8636659613827122
 -0.6703658653069542
  0.1806043161686472
  0.10552342922775357
  0.0214449396217032
  0.0022774832947497475
  0.00012622643173822303
  2.8896433329375396e-6

julia> qr(X) \ df_filip.y
11-element Vector{Float64}:
 -1467.4896018652016
 -2772.179569812206
 -2316.3710641177045
 -1127.9739329347135
  -354.4782313151358
   -75.12420126161706
   -10.875317970213466
    -1.0622149798557083
    -0.06701911509851412
    -0.002467810770121917
    -4.029625231108671e-5

@mousum-github
Copy link
Collaborator Author

Q,R = qr(X);
diag(R)
11-element Vector{Float64}:
   -9.055385138137417
  -13.532654650686624
  -21.825229067114073
   30.328064942426202
   44.4823844079297
   61.77383426163982
   90.2629854477562
 -127.0559597329256
  186.65576017620813
  253.04777394199888
 -373.39818100600036

rank(R)
10

I was also a bit surprised when I checked the diagonal elements of R and rank(R).
In fact, the inv function generates the inverse of R

Rinv = inv(R)
Rinv * Rinv' * X'y
11-element Vector{Float64}:
 -1467.489589088171
 -2772.1795409834594
 -2316.371035859374
 -1127.973917018297
  -354.4782255940765
   -75.12419988749059
   -10.875317746475703
    -1.0622149554340137
    -0.06701911338591168
    -0.0024678107003603396
    -4.029625105615225e-5

which is again a little bit different from what is generated by qr(X)\y.

@andreasnoack
Copy link
Member

andreasnoack commented May 6, 2024

The QR based least squares solution is

julia> F = qr(X);

julia> F.R\(F.Q'*df_filip.y)[1:size(F, 2)]
11-element Vector{Float64}:
 -1467.4896018652016
 -2772.179569812206
 -2316.3710641177045
 -1127.9739329347135
  -354.4782313151358
   -75.12420126161706
   -10.875317970213466
    -1.0622149798557083
    -0.06701911509851412
    -0.002467810770121917
    -4.029625231108671e-5

@andreasnoack
Copy link
Member

andreasnoack commented May 6, 2024

Oh. Now I realize what is going on in the pivoted case. There is a rank determination by default, so the relevant comparison is

julia> ldiv!(qr(X, ColumnNorm()), df_filip.y[:,:], 0.0)[1][1:size(X, 2)]
11-element Vector{Float64}:
 -1467.4896118403967
 -2772.1795879957995
 -2316.371078886713
 -1127.9739399822226
  -354.4782335067247
   -75.12420172645771
   -10.875318038405625
    -1.062214986693476
    -0.06701911554715413
    -0.002467810787511543
    -4.029625261329208e-5

where the third argument to ldiv! is the tolerance used when determining the numerical rank. To conclude, I think we just need to get rid of/adjust the check in

rnk == length(p.delbeta) || throw(RankDeficientException(rnk))
then we'd be able to match the reference estimates when setting dropcollinear = false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants