-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] Use ALTREP system to return C++-allocated arrays #6213
Conversation
I unfortunately am unable to find why the CI jobs are failing. Seems to run fine in my machine with |
The The For all the other R-package job failures... they are not showing up on Those can be reproduced by using the LightGBM/.ci/test_r_package.sh Lines 149 to 150 in 2ee3ec8
I haven't reviewed this PR yet so I can't provide more specific guidance than that yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most useful thing to ALTREP here would be the externalptr handles through R_make_altlist_class introduced in R 4.3,
Before we go too far with this, I want to be clear... I don't support dropping support for R < 4.3
for the benefit "avoiding redundant memory copies" at this time, unless you can make the case that those redundant copies are significantly problematic for a majority of LightGBM's users.
The first R 4.3.x, v4.3.0, has only been out for about 7 months (release history).
r-oldrel
on CRAN is still on R 4.2.3, for example:
rversions::r_oldrel()
# version date nickname
# 130 4.2.3 2023-03-15 08:06:01 Shortstop Beagle
Also, just for my understanding, are the changes in this PR the same type of uses of ALTREP that you referred to as "rather hacky" over in dmlc/xgboost#9734 (comment)?
I'm not familiar with ALTREP and will need to do some reading to effectively review this. Any evidence or documentation you could provide to help us understand what this mechanism is, how it works, and what the benefit of these changes are to LightGBM would be helpful.
From some investigation, this is because the cmake build uses name It can be fixed for the cmake builds by changing the name of the init function from Is there perhaps some macro that is defined for the cmake builds but not for the CRAN ones or vice-versa? Otherwise, I don't know what would be the right way to solve this.
No, this is standard usage of ALTREP as it was designed to, and as can be found on articles and examples elsewhere:
Currently, lightbm creates a serialized version of models on every call. Lots of models created in interactive sessions don't end up being saved to disk. This increases fitting times and memory usage in the R interface in a way which doesn't happen in other interfaces. Then, it also makes models read from a serialized file (like But that'd be for using ALTREP to handle serialization of |
Sorry for the long delay! I'm ready to come back and help with reviews to move this forward.
There is not, currently. Elsewhere in the R package, we handle this particular thing by having That's done here: LightGBM/build-cran-package.sh Lines 168 to 183 in 78d021c
Could you try:
Excellent, thanks very much for the links! That's helpful.
Thanks very much for the explanation of the benefits of more ALTREP changes. I learned a lot from your descriptions in dmlc/xgboost#9847. At this time I still don't support dropping support for R 3.6 in future PRs in exchange for those benefits. If it's possible to make thosechanges that require |
Modified to use |
Now it looks like only the msvc jobs are failing while other cmake jobs are passing. I cannot find any error message in those msvc jobs so don't know what went wrong. |
R-package/src/lightgbm_R.cpp
Outdated
|
||
void R_init_lightgbm(DllInfo *dll) { | ||
void R_init_lib_lightgbm(DllInfo *dll) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this and the similar change above it are necessary. I suspect those might be the cause of the MSVC failures you're seeing.
Can you please revert them? This function should be called R_init_lightgbm
, for both the CRAN (autotools
) package and the CMake-based builds.
As described in https://cran.r-project.org/doc/manuals/R-exts.html
It is good practice for DLLs to register their symbols (see Registering native routines), restrict visibility (see Controlling visibility) and not allow symbol search (see Registering native routines). It should be possible for a DLL to have only one visible symbol, R_init_pkgname, on suitable platforms84, which would completely avoid symbol conflicts.
See also:
R_init_lightgbm |
and #4494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would fail in my setup if I rename the init function to not have lib_
. Changing it in the altrep registration name doesn't seem to generate any failures though, so I've changed it there to see if it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be very surprised to learn that this function name needs to change as a part of this PR. It's already a part of the C API for lib_lightgbm.dll
(from the perspective of the R package built with MSVC), and the changes in this PR are all otherwise not exposed as C API changes.
At a minimum, I suspect the CI failures you're seeing for MSVC are because you're changing this function name but not changing the lightgbm-win.def
file I linked above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other cmake-based jobs that were tested here, leaving the name as R_init_lightgbm
makes the job fail, while the same jobs succeed when changing it to R_init_lib_lightgbm
.
The -win.def
file that you mentioned AFAIK is not used in the msvc build from cmake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm ok. Thanks for looking into that!
I still don't understand though what about these changes would require that function name to be changed to R_init_lib_lightgbm
when it is working on master
as R_init_lightgbm
. Given that you're just adding new behavior inside that function.
Anyway, maybe this and the .def
file are not the right place to look anyway.... looking more closely at the failing builds, I see that quite a few unit tests are passing before those jobs eventually fail.
e.g. on https://github.com/microsoft/LightGBM/actions/runs/7413984613/job/20174104011?pr=6213
----- end of build and install logs -----
Running tests with testthat.R
basic: ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
custom_objective: ..........
dataset: ................................S.............................................
learning_to_rank: ....................................................
lgb.Booster: ..........................................................................
Error: Process completed with exit code -1.
I have a Windows laptop with MSVC and R on it.... I can pull this branch and test. Maybe I'll at least be able to get some logs pointing to what is happening. Powershell provided on GitHub Actions runners (which we use to orchestrate those tests, see https://github.com/microsoft/LightGBM/blob/master/.ci/test_r_package_windows.ps1) is very picky about how it manages stderr and we've put in many changes to work around that in the future. For example:
LightGBM/.ci/test_r_package_windows.ps1
Lines 16 to 29 in 48e3629
# External utilities like R.exe / Rscript.exe writing to stderr (even for harmless | |
# status information) can cause failures in GitHub Actions PowerShell jobs. | |
# See https://github.community/t/powershell-steps-fail-nondeterministically/115496 | |
# | |
# Using standard PowerShell redirection does not work to avoid these errors. | |
# This function uses R's built-in redirection mechanism, sink(). Any place where | |
# this function is used is a command that writes harmless messages to stderr | |
function Run-R-Code-Redirect-Stderr { | |
param( | |
[string]$rcode | |
) | |
$decorated_code = "out_file <- file(tempfile(), open = 'wt'); sink(out_file, type = 'message'); $rcode; sink()" | |
Rscript --vanilla -e $decorated_code | |
} |
Maybe some useful error or warning is being written to stderr and we're not seeing it in CI logs as a result of those changes.
Thanks for all your efforts so far! I can take over the investigation for a bit. I think it's likely the issue is about the peculiarities of LightGBM's CI setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the non-msvc cmake ones, before the name change to R_init_lib_lightgbm
, the tests would fail when they try to do something with the variables representing altrep classes introduced here (R_altrep_class_t
), which would not have been initialized by the time the .Call
uses them and thus have invalid pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe a start could be to verify whether lines like this one are run before .Call
:
lgb_altrepped_char_vec = R_make_altraw_class("lgb_altrepped_char_vec", "lightgbm", dll);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iiiiinteresting, thanks for that!
Since you mention this being maybe related to the timing of when those statements run, I think it's important to point out a difference between the MSYS/MinGW CMake CI jobs and the MSVC ones.
The MSYS/MinGW ones run R CMD check
.
LightGBM/.ci/test_r_package_windows.ps1
Line 133 in 48e3629
if ($env:COMPILER -ne "MSVC") { |
LightGBM/.ci/test_r_package_windows.ps1
Line 179 in 48e3629
Run-R-Code-Redirect-Stderr "result <- processx::run(command = 'R.exe', args = $check_args, echo = TRUE, windows_verbatim_args = FALSE, error_on_status = TRUE)" ; $check_succeeded = $? |
The MSVC ones just run Rscript --vanilla R-package/tests/testthat.R
LightGBM/.ci/test_r_package_windows.ps1
Lines 281 to 288 in 48e3629
if ($env:COMPILER -eq "MSVC") { | |
Write-Output "Running tests with testthat.R" | |
cd R-package/tests | |
# NOTE: using Rscript.exe intentionally here, instead of Run-R-Code-Redirect-Stderr, | |
# because something about the interaction between Run-R-Code-Redirect-Stderr | |
# and testthat results in failing tests not exiting with a non-0 exit code. | |
Rscript.exe --vanilla "testthat.R" ; Check-Output $? | |
} |
That could result in different code paths being followed and a different order of things being initialized? I'm not sure. If that's the case, maybe we'll have to add a .onLoad()
or .onAttach()
to the package (though I really hope not).
I'll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to work on removing this difference in the library name (lib_lightgbm
vs. lightgbm)
in the different types of R builds here. Work in progress at #6432. Hopefully it'll help with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6432 has been merged, so:
- all this
lib_lightgbm
vs.lightgbm
stuff in the R package is gone - the CMake-based builds now should do routine registration and invoke
R_init_lightgbm()
just like the CRAN-style build do
Let's see in CI if it helps.
Sorry for the VERY long delay in reviews. As I'm sure you've noticed, we are struggling with a lack of maintainer availability here. I have a Windows laptop set up for development now, will look into this today. |
I am unable to build the library with MSVC after the latest updates and cannot find full error traces in the failing jobs, so not sure what is failing now. |
Ok @david-cortes I have some more debugging information to share, hopefully it'll help. On my Windows laptop (Windows 10 Home, R 4.4.0, Rtools44), on latest The change in #6451 appears to fix them. I think the root cause was CMake detecting an old target Windows version and therefore Windows SDK. With that change applied on this branch, I was also able to reproduce the test failure we're seeing for MSVC jobs in CI, and narrow it down to a specific test case.
To reproduce, you could apply the changes from #6451 to this branch, then install the package with Rscript build_r.R --no-build-vignettes Then try the following in an R session. library(lightgbm)
set.seed(708L)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
bst <- lightgbm::lightgbm(
data = as.matrix(train$data)
, label = train$label
, params = list(
num_leaves = 4L
, objective = "binary"
, verbose = 1
, num_threads = 2
)
, nrounds = 2L
)
# NOTE: only values of 0 or 1 are supported in the underlying API called here
bst$save_model_to_string(
feature_importance_type = 2L
) On
And the session does not crash. On this branch, installing the same way, that code emits that error text to stderr and then the R session crashes. In case it helps, here's the place where that error is raised from: LightGBM/src/boosting/gbdt_model_text.cpp Line 658 in 20f2092
|
I can confirm that the code under No idea why it'd be crashing though, and am unfortunately not familiar with debugging tools for MSVC. Could this perhaps have something to do with the compile define STRUCT_SUBTYPES for altrep classes? It doesn't seem to be user-configurable, but it looks like it has an alternative definition of altrep class structure for what I assume could be problematic compilers. |
After some testing, it's actually not anyhow related to Only solution I can think of would be to disable this feature if it detects that it's being compiled with MSVC. |
After some testing, it is actually a matter of having the alternative definition of data structures when After manually disabling it in the R Unfortunately, the macro is forcibly defined in the same R header: So I cannot think of any mechanism that could be used to prevent its definition in the lines below that one. |
Thank you so much for investigating that! I read back through all the links you provided (#6213 (comment)) and did my own searches for MSVC-specific discussions about ALTREP... didn't find anything suggesting another path forward. I'd support modifying this PR such that the feature is disabled when compiling with MSVC. I know that adds some complexity to the code, but maybe it'd be worth it for a while to keep that separation... in case there's some specific planned breaking changes to the ALTREP API that led Dr. Tierney to push this 3 weeks ago: wch/r-source@fb52ac1 /*
Not part of the API, subject to change at any time.
This API is experimental and may change on short notice.
Package authors using this API should be prepared to adapt to changes
when they occur.
*/ |
I couldn't find any better solution, so I've now disabled usage of altrep classes when compiling with msvc. |
Did some additional investigation about the failing MSVC tests. It looks like there's some additional issue apart from the altrep struct declaration. The failing test from the previous snippet doesn't even reach the Altrep section - it crashes somewhere inside the If the I am guessing there must either be a compiler bug somewhere, or there must be some additional memory corruption happening before the call to So I'm reverting back the whole function for MSVC this time. |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: failure ❌. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working through this! And for starting a thread on r-pkg-devel (link).
This looks great to me, and now that I'm caught up on how ALTREP works (thanks for the links), I'm excited for the potential.
So I'm reverting back the whole function for MSVC this time.
I support isolating this function for MSVC this way.
I've triggered the valgrind checks on this branch. Assuming that those pass, I support merging this and getting it into the v4.4.0 release (#6439).
I'd appreciate if we could get one more C++ review @shiyu1994 @guolinke @btrotta
R-package/src/lightgbm_R.cpp
Outdated
#ifndef _MSC_VER | ||
SEXP LGBM_BoosterSaveModelToString_R(SEXP handle, | ||
SEXP num_iteration, | ||
SEXP feature_importance_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just merged #6398. In that, @mayer79 exposed start_iteration
on LGBM_BoosterSaveModelToString_R()
. Could you add that on this implementation?
Sorry for that, it's unfortunate timing that these two PRs both modifying LGBM_BoosterSaveModelToString_R()
reached the ready-to-be-merged state at roughly the same time.
From the valgrind logs, I see two of them mention:
Which is most likely a false positive since valgrind doesn't work correctly with The other doesn't mention
So I'd say there's a very high chance that it's also coming from |
I totally agree with you. And even if it isn't a false positive.... it's not a result of this PR's changes. I just ran these tests on a different branch (#6463 (comment)) and see the exact same findings from I don't have permissions to admin-merge over a failing CI check here. I'll get those tests passing on #6463 then do a rebuild here and merge this. |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
Thanks very much for this!! |
This PR modifies the R interface to return some C++-allocated objects as ALTREP'd R vectors, thereby avoiding redundant memory copies.
I think the most useful thing to ALTREP here would be the
externalptr
handles throughR_make_altlist_class
introduced in R 4.3, so that serialization (calls toLGBM_BoosterSaveModelToString
andLGBM_BoosterLoadModelFromString
) would only be triggered on calls to functions likesaveRDS
/readRDS
(like the python interface does), but seeing as the handle serialization requires metadata likeimportance_type
, the interface would first need to be modified to make these parameters fixed instead of user-modifyiable.