-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control order of execution with query planner #11067
Labels
needs triage
Needs a response from a contributor
Comments
Hi, thanks for your report. This is definitely a bug, this should behave the same as on the previous version. I'll put up a solution |
I migrated |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the issue:
We often use dask together with plateau to perform the following task in parallel:
We typically do this with datasets that don't fit into our machine's RAM. This used to work fine, but with the new query planner, we typically run out of memory, because dask now loads all the data before doing a transformation. Before, dask would operate chunk by chunk.
Minimal Complete Verifiable Example:
I tried to come up with a simple example that doesn't involve plateau (or a lot of data):
The behavior without the query planner is:
So, we load, transform and store and won't run out of memory.
Here, we first load everything and so we'll run out of memory.
map_partitions
?map_partitions
for something like this?DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION=1
doesn't helpAnything else that we should know?
Note that it's really tricky to turn off the query planner (other than through environment variables). When using
dask.config
from within Python, it's important that the config is set before the first import ofdask.dataframe
, which is difficult to control.Environment:
Thank you!
The text was updated successfully, but these errors were encountered: