Writing to parquet does does not release memory #550

1beb · 2021-10-11T19:26:41Z

1beb
Oct 11, 2021

When writing to parquet inside of a parallel process, memory used is never released, regardless of explicit use of gc(), without a session restart.

Here's a reproducible example:

library(arrow)
library(future.apply)

plan(multisession, workers = 5, gc=TRUE)

system("free --giga")

future_lapply(1:10, function(x) {
  gc()
  df <- mtcars[sample(seq_len(nrow(mtcars)), replace = T, 8e6),] # Adjust here if your sys RAM is low
  write_parquet(df, sink = paste0("p",x, ".parquet"))
  rm(list = ls(all.names = T))
  gc() # should release memory held for df
}, future.seed = 42)

system("free --giga")
gc() # expect nothing
system("free --giga")

I suspect this is because the C++ library is using threads internally. In my use case, I'm often running up against this, because the memory just "hangs on" over and over as the overall process runs, eventually leading to an OOMkill.

Using arrow::set_cpu_count(1) is better, but not perfect. It's unclear if this means "only use 1 cpu, the same one from the R session" or "only use 1 other CPU".

Answered by HenrikBengtsson

Oct 11, 2021

If you have the option to chose, maybe you're better of using plan(future.callr::callr, ...), which uses a temporary, independent R process for each future that are shut down after the future completes. The downside is more overhead.

I guess the arrow folks are better suited for answering the question regarding memory not being released. If there's memory creep in parallel workers, I suspect it'll also happen in sequential mode as well - it's just that it'll take longer in sequential mode since it's a single process whose memory is growing instead of multiple.

View full answer

HenrikBengtsson · 2021-10-11T20:27:53Z

HenrikBengtsson
Oct 11, 2021
Maintainer

If you have the option to chose, maybe you're better of using plan(future.callr::callr, ...), which uses a temporary, independent R process for each future that are shut down after the future completes. The downside is more overhead.

I guess the arrow folks are better suited for answering the question regarding memory not being released. If there's memory creep in parallel workers, I suspect it'll also happen in sequential mode as well - it's just that it'll take longer in sequential mode since it's a single process whose memory is growing instead of multiple.

3 replies

1beb Oct 11, 2021
Author

Excellent! That worked like a charm! I have a small follow-up question, could this approach also be used on a manual cluster (like https://github.com/paciorek/future-kubernetes)? I have the same kind of issue happening on a much bigger problem that uses k8s.

HenrikBengtsson Oct 11, 2021
Maintainer

Don't know, but maybe @paciorek can comment?

paciorek Oct 12, 2021

It doesn't look like manual is an option with future.callr::callr:

> plan(future.callr::callr, manual=TRUE)
Warning message:
In tweak.future(function (expr, envir = parent.frame(), substitute = TRUE,  :
  Detected 1 unknown future arguments: ‘manual’

paciorek · 2021-10-12T00:27:44Z

paciorek
Oct 12, 2021

Also, I was intrigued by the apparent memory leak, so I played around with this some. If I replace write_parquet with write.csv, then based on free it seems like R is still holding onto multiple GB of memory. And even if I write out the original mtcars (but still create the big df first) I still see a bunch of memory use even after the future_lapply finishes. And similarly if using parallel::parLapply. Though in these other cases the gc() at the end of the function reports minimal memory use.

So it's not clear to me that this is because of something arrow is doing.

4 replies

1beb Oct 15, 2021
Author

@HenrikBengtsson Should I move this over to issues? It looks like the behaviour is not limited to arrow.

HenrikBengtsson Oct 15, 2021
Maintainer

My gut feeling is, as I mentioned in #550 (comment), that this memory "leak" is independent of the future framework or whatever parallel framework you use. I'd expect this memory creep to also happen when you run traditional, sequential processing in a single R session. If you try that, I'm pretty certain that you'll see that the memory will keep growing the same way and not being released when you call gc(). Have you benchmarked that case?

The reason why it "works" with plan(future.callr::callr) is that each future runs in a fresh R background process that is terminated immediately afterward. It's a rough workaround, but obviously it works because it always keeps shutting down those workers. It emulates what you'd get if (when?) you see the same memory leak in sequential mode and you restart R manually to fix it.

HenrikBengtsson Oct 15, 2021
Maintainer

So, basically replace that future_lapply() with lapply() in your example, and I expect you'd see the same memory problem.

1beb Oct 15, 2021
Author

I'll update this discussion with a test of this nature (not in parallel) a little later, for science!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing to parquet does does not release memory #550

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Writing to parquet does does not release memory #550

1beb Oct 11, 2021

Replies: 2 comments · 7 replies

HenrikBengtsson Oct 11, 2021 Maintainer

1beb Oct 11, 2021 Author

HenrikBengtsson Oct 11, 2021 Maintainer

paciorek Oct 12, 2021

paciorek Oct 12, 2021

1beb Oct 15, 2021 Author

HenrikBengtsson Oct 15, 2021 Maintainer

HenrikBengtsson Oct 15, 2021 Maintainer

1beb Oct 15, 2021 Author

1beb
Oct 11, 2021

Replies: 2 comments 7 replies

HenrikBengtsson
Oct 11, 2021
Maintainer

1beb Oct 11, 2021
Author

HenrikBengtsson Oct 11, 2021
Maintainer

paciorek
Oct 12, 2021

1beb Oct 15, 2021
Author

HenrikBengtsson Oct 15, 2021
Maintainer

HenrikBengtsson Oct 15, 2021
Maintainer

1beb Oct 15, 2021
Author