Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove non-API calls in C #6180

Open
5 of 11 tasks
MichaelChirico opened this issue Jun 13, 2024 · 27 comments
Open
5 of 11 tasks

Remove non-API calls in C #6180

MichaelChirico opened this issue Jun 13, 2024 · 27 comments
Milestone

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 13, 2024

We are getting these NOTEs on CRAN, e.g.

Found non-API calls to R: ‘SETLENGTH’, ‘SET_S4_OBJECT’,
‘SET_TRUELENGTH’, ‘UNSET_S4_OBJECT’

I'm really skeptical about the SETLENGTH and SET_TRUELENGTH parts, which have long been a key underlying feature of data.table internals. I propose we try and keep using those unless CRAN really forces our hands. It be really great if someone wanted to do a performance comparison of the package with/without those calls.

Anyway, it should be easier to remove the non-API S4 calls.

The list does keep changing as this is an actively evolving piece of r-devel. Here's the current list of things we should (eventually) address:

  • LEVELS
  • NAMED
  • SETLENGTH
  • SET_GROWABLE_BIT
  • SET_S4_OBJECT
  • SET_TRUELENGTH
  • SET_TYPEOF
  • STRING_PTR
  • TRUELENGTH
  • UNSET_S4_OBJECT
  • isFrame
@MichaelChirico MichaelChirico added this to the 1.16.0 milestone Jun 13, 2024
@tdhock
Copy link
Member

tdhock commented Jun 13, 2024

there is a web page which lists all of the API calls, grouped into 3 categories
https://yutannihilation.github.io/R-fun-API/
see also the R-devel thread which is linked from that page

@ecoRoland2
Copy link

You should probably be proactive and contact R-core regarding your need of an API to SETLENGTH and SET_TRUELENGTH. I would hate seeing slower performance in data.table. It could cause some difficult decisions at my day job.

@MichaelChirico
Copy link
Member Author

if this is important for you (or anyone else reading) please take the time to construct a performance comparison.

@ecoRoland2
Copy link

I would but unfortunately I think you need at least basic knowledge of C to do that.

@ben-schwen
Copy link
Member

AFAIU we can replace SETLENGTH calls with Rf_lengthgets calls and reassignment. But there is no safe way to handle SET_TRUELENGTH right?

@tdhock
Copy link
Member

tdhock commented Jun 14, 2024

Where is SET_TRUELENGTH used? https://github.com/search?q=repo%3ARdatatable%2Fdata.table%20SET_TRUELENGTH&type=code
one place is in rbindlist which is a pretty important function for performance.
@Anirban166 can you please construct a PR with a single change SET_TRUELENGTH -> Rf_lengthgets in rbindlist along with an atime performance test?

@ben-schwen
Copy link
Member

ben-schwen commented Jun 14, 2024

Where is SET_TRUELENGTH used?

We use it as part of memrecycle and subsetting so its basically in most operations :D

But we are not alone with this problem r-lib/vctrs#1933

@Anirban166
Copy link
Member

@Anirban166 can you please construct a PR with a single change SET_TRUELENGTH -> Rf_lengthgets in rbindlist along with an atime performance test?

Sure, but I'm wondering what the tests will be focused on (since this affects multiple functions) and then what about SETLENGTH?

@MichaelChirico
Copy link
Member Author

AFAIU we can replace SETLENGTH calls with Rf_lengthgets calls and reassignment.

I don't see lengthgets mentioned in WRE.

By the same definition, touching S4 objects from C is also non-API! 🤔

@MichaelChirico
Copy link
Member Author

I see, lengthgets is just what length<- calls under the hood. But that induces a copy:

a = 1:10
address(a)
# [1] "0x5651b8564b30"
length(a) <- 20
address(a)
# [1] "0x5651b8ab5c68"

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jun 20, 2024

It looks like more of our usage has been marked as non-API in the meantime:

Result: NOTE 
  File ‘data.table/libs/data_table.so’:
    Found non-API calls to R: ‘LEVELS’, ‘NAMED’, ‘SETLENGTH’,
      ‘SET_GROWABLE_BIT’, ‘SET_S4_OBJECT’, ‘SET_TRUELENGTH’,
      ‘UNSET_S4_OBJECT’

new:

LEVELS():

#define IS_UTF8(x) (LEVELS(x) & 8)
#define IS_ASCII(x) (LEVELS(x) & 64)
#define IS_LATIN(x) (LEVELS(x) & 4)

#define ENC_KNOWN(x) (LEVELS(x) & 12)

NAMED():

#ifndef MAYBE_SHARED
# define MAYBE_SHARED(x) (NAMED(x) > 1)
#endif
#ifndef MAYBE_REFERENCED
# define MAYBE_REFERENCED(x) ( NAMED(x) > 0 )
#endif

data.table/src/assign.c

Lines 548 to 552 in fab93a9

i+1, NAMED(thisvalue), MAYBE_SHARED(thisvalue), length(values), length(cols));
}
thisvalue = copyAsPlain(thisvalue); // PROTECT not needed as assigned as element to protected list below.
} else {
if (verbose) Rprintf(_("Direct plonk of unnamed RHS, no copy. NAMED==%d, MAYBE_SHARED==%d\n"), NAMED(thisvalue), MAYBE_SHARED(thisvalue)); // e.g. DT[,a:=as.character(a)] as tested by 754.5

SET_GROWABLE_BIT():

#if !defined(R_VERSION) || R_VERSION < R_Version(3, 4, 0)
# define SET_GROWABLE_BIT(x) // #3292
#endif

SET_GROWABLE_BIT(VECTOR_ELT(DT,i)); // #3292

@tdhock
Copy link
Member

tdhock commented Jun 21, 2024

@Jean-Romain
Copy link

Jean-Romain commented Jun 21, 2024

@tdhock there is nothing about SETLENGTH and SET_TRUELENGTH in this new section. I'm in trouble with that one too. I can remove the feature that uses it but I'm wondering how data.table team will handle as it is a key feature for data.table. See also https://stat.ethz.ch/pipermail/r-devel/2024-June/083449.html

@jangorecki
Copy link
Member

I can imagine group by, when not gforce optimized, is setting true length as well. It allocates memory for a biggest group and then reuses that space for smaller groups as well.
Similarly frollapply is doing the same for adaptive=TRUE (possibly not yet in master).

@ben-schwen
Copy link
Member

This issue also makes our CI fail since we get now a new 3rd NOTE: Compiled code should not call non-API entry points in R. Should we alter the CI to allow 3 notes on dev?

@jangorecki
Copy link
Member

Yes, for the moment till we address this issue. CI is used to spot new issues, this one is already known.

@MichaelChirico
Copy link
Member Author

After #6312 and #6313 I'll bump the milestone here to the next release, I think we've put in good effort and definitely applied the fixes they recommend in R-exts. The others remain more difficult to satisfy & can be handled later.

@jangorecki
Copy link
Member

Afair growable vectors are part of the public api, and those calls which apparently should be removed now are essential tools to work with growable vectors. That may be some argument to request official support for those.

@MichaelChirico
Copy link
Member Author

Would be worth clarifying on r-devel then. Currently, SET_GROWABLE_BIT is under "Stuff that is not API and probably should not be but is getting used."

https://github.com/r-devel/r-svn/blob/f4e98ffcff71d5d04f60ab3201d58c32b4b6e944/src/include/Rinternals.h#L1236

And there's no mention in WRE of "growable"

@jangorecki
Copy link
Member

Uh then it is possibly not a public api then. I had that impression when recalling how well it was announced when got implemented.

@SebKrantz
Copy link
Member

SebKrantz commented Oct 24, 2024

Have you guys been in touch with R-core about the growable vectors issue? Or do you know of anyone who has been in touch with them regarding this? Seems to me := would be impossible without it, you'd have to reimplement chmatch() using a proper hash table, and also rbindlist() uses it, and probably other functions as well.

@jangorecki
Copy link
Member

I don't think you mean growable vectors. AFAIU all three functions you mentioned were implemented in DT before growable vector happened in R.

@SebKrantz
Copy link
Member

SebKrantz commented Oct 24, 2024

Ok, yes I mean the manipulation of the visible and true vector length with SETLENGTH and SET_TRUELENGTH - you use that to overallocate data.tables and make them behave like normal data.frames at the R-level which is effectively creating a growable vector - the vector is fixed but the overallocated part is hidden. In chmatch() you use the same functions to use the CHARSXP table as a hash table. Anyway, it seems to me := wouldn't be possible in an API conform way without this length manipulation - thus my question about R-core engagement.

@aitap
Copy link
Contributor

aitap commented Oct 25, 2024

Over-allocated vectors (including lists since R-4.3.0) can be implemented using ALTREP in something like 60-100 lines of code (separate implementation for each primitive type, unfortunately). I expect it to be hard but not impossible to (1) refactor data.table's use of SETLENGTH, TRUELENGTH, SET_GROWABLE_BIT into an abstract "over-allocated growable vector" interface and (2) reimplement it in ALTREP #if R_VERSION >= R_Version(4,3,0).

Use of SET_TRUELENGTH in string matching functions as a free integer-sized field in every CHARSXP will be much harder to get rid of. I'm not seeing any other fields in a CHARSXP that could be used in API-compliant way for this purpose, which leaves hash tables, which have different performance characteristics. Also, the C standard doesn't guarantee that if (void*)a == (void*)b, (uintptr_t)a == (uintptr_t)b will also hold.

@SebKrantz
Copy link
Member

It's a nice idea, although it would significantly reduce backwards compatibility by one major version (data.table currently depends R >= 3.3.0) and I'm not sure what consequences ALTREP would have for the interopertability of data.table, particularly for package C code meant to manipulate standard data frames (and for storage as rds and reading if data.table is not loaded)? In any case, if maintainers are unsure how to proceed, perhaps invite R-core to the conversation (e.g. Thomas Kalibera (kalibera) or Luke Tierney (ltierney)). Seems also the vctrs team has not found a solution yet (r-lib/vctrs#1933) so it seems this is not a small issue (and we're talking about probably a third or quarter of CRAN packages that depends on data.table and vctrs, so it must be of concern to R-core that a good solution is found here)

@aitap
Copy link
Contributor

aitap commented Oct 27, 2024

Part of the idea is to keep both implementations and use the original one for R ∈ [3.3.0; 4.3.0) and the ALTREP one for R ≥ 4.3.0. Maintenance burden, yes, but the old code is already quite stable, and the internals of old R versions aren't going to change.

The C code, even if it requests raw data pointers, shouldn't notice anything amiss: since the ALTREP class already carries a full-blown over-allocated object of the same type inside it, it can return its (writeable) data pointer when requested.

Serialization will have similar properties to what we have right now: if an ALTREP class does not implement a Serialize method, R will serialize the contents of the object as if it was an ordinary value, requiring a setDT() when it's reloaded. If we really want to, we can implement Serialize later, so that the ALTREP objects would come out already over-allocated.

An ALTREP data.table could also take care of duplicating itself properly, obviating a whole class of bugs resulting from it being shallowly duplicated by attr<- and many other R operations.

@SebKrantz
Copy link
Member

Sounds Good. It would be ideal of course if other frameworks like collapse don't need to take additional actions to interoperate with data.table. I will stay updated with this (SebKrantz/collapse#653). Happy to assist with the implementation as well, although I haven't used ALTREP yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants