Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serializedSize(): Does it allocate memory? #126

Open
HenrikBengtsson opened this issue Feb 7, 2025 · 2 comments
Open

serializedSize(): Does it allocate memory? #126

HenrikBengtsson opened this issue Feb 7, 2025 · 2 comments

Comments

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Feb 7, 2025

Background

Over at futureverse/future#760, @fproske reports that serializedSize() consumes a lot of memory. They detected this because their containers/VMs are getting killed by OOM, after upgrading to a future version that rely on serializedSize().

I think they used the profvis package to show that they see about ~100 MB of memory allocated by serializedSize(). Indeed, if I run something like:

prof <- profvis::profvis({
  for (kk in 1:1e6) parallelly::serializedSize(NULL)
})

I see lots of memory being reported, e.g.

Code File Memory (MB) [deallocated/allocated]
parallelly::serializedSize <expr> -4246.6 / 4492.0
for (kk in 1:1e6) parallelly::serializedSize(NULL) <expr> -445.3 / 221.1

but that looks odd to me.

Troubleshooting

I'm not sure how this happens, but it could be that the internal serialization code of R that we rely on materializes each intermitten object, which we never make use of - we are only interested in the byte counts. Our code is in https://github.com/futureverse/parallelly/blob/develop/src/calc-serialized-size.c.

It could be that something else is going on here. To better inspect them memory allocations, I going low-level base::Rprof(), which profvis uses internally. With this, I get:

library(parallelly)
R <- 1e7

ns <- c(0, 1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7)
data <- data.frame(n = ns, size = double(length(sizes)), bytes_per_call = double(length(sizes)))

for (kk in seq_len(nrow(data))) {
  n <- data$n[kk]
  x <- rnorm(n)
  size <- object.size(x)
  message(sprintf("Object size: %d bytes", size))
  data[kk, "size"] <- size

  Rprof(memory.profiling = TRUE)
  for (rr in 1:R) { serializedSize(x) }
  Rprof(NULL)
  prof <- summaryRprof(memory = "both")
  mem_avg <- prof$by.total[['"serializedSize"', "mem.total"]] * 1024^2 / R
  data[kk, "bytes_per_call"] <- mem_avg
}

print(data)

With R = 1e5, I get:

      n     size bytes_per_call
1 0e+00       48       907.0182
2 1e+00       56       717.2260
3 1e+02      848       959.4470
4 1e+03     8048       761.2662
5 1e+04    80048      2690.6460
6 1e+05   800048      2552.2340
7 1e+06  8000048      2403.3362
8 1e+07 80000048      2819.6209

With R = 1e6, I get:

> data
      n     size bytes_per_call
1 0e+00       48      3794.3771
2 1e+00       56      4143.5529
3 1e+02      848      2741.6068
4 1e+03     8048      2570.6889
5 1e+04    80048      1286.8125
6 1e+05   800048       298.8442
7 1e+06  8000048       294.0207
8 1e+07 80000048      2794.4550

With R = 1e7, I get:

      n     size bytes_per_call
1 0e+00       48       1587.072
2 1e+00       56       1433.246
3 1e+02      848       1556.359
4 1e+03     8048       1615.164
5 1e+04    80048       2918.145
6 1e+05   800048       1345.313
7 1e+06  8000048       1154.765
8 1e+07 80000048       5260.957

I'm not sure what to make of this, because this says that only 2-5 kB is allocated per serializedSize() call regardless of size of object being sized.

@coolbutuseless, as a expert on serialization and the one who came up with serializedSize(), do you know if the internals materialize the different objects as they are being serialized? If so, do you if the R API allows us to avoid that? For instance, if I use:

con <- file(nullfile(), open = "wb")
void <- serialize(x, connection = con)
close(con)

I think the objects being serialized are immediately streamed to the null file, avoiding any materializing in memory. I wonder if that strategy could be used in serializedSize().

@coolbutuseless
Copy link

coolbutuseless commented Feb 9, 2025

Hi @HenrikBengtsson ,

My understanding of R internals is that there is no materialization happening during the serialization process.

Within R internals, it walks the object being serialized and passes the pointers to the members of that object (and a length) to your specified callback (i.e. count_bytes()). It is not recreating the objects, nor allocating any space to keep them.

Your memory usage calcs with Rprof agree with my understanding of what is happening - no more than a few kB to serialize data - and totally independent of the size of the object being serialized. There's just a minimal number of allocations - probably associated with bookkeeping that R is doing internally while it's walking the object.

@HenrikBengtsson
Copy link
Collaborator Author

@coolbutuseless , thank you so much for your insight.

Within R internals, it walks the object being serialized and passes the pointers to the members of that object (and a length) ...

The "passes the pointers" is exactly what I was hoping for. Excellent.

So, it's still to be understood:

  1. why profvis::profvis() reports such different amounts of memory allocation compared to Rprof(), despite using that internally as well, and

  2. why OOM kicks in Use of serializedSize uses significant memory future#760 - hopefully it's just that the memory consumption increased slightly after upgrading future, but enough to push it above the OOM threshhold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants