-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turning off query planning is difficult #11070
Labels
needs triage
Needs a response from a contributor
Comments
Here's another example, where I struggle to turn off query planning: I'm using plateau to load a dask dataframe from disk: import dask
import pandas as pd
from tempfile import TemporaryDirectory
from plateau.io.dask.dataframe import read_dataset_as_ddf
from plateau.io.eager import store_dataframes_as_dataset
dataset_dir = TemporaryDirectory()
store_url = f"hfs://{dataset_dir.name}"
def create_plateau_dataset():
store_dataframes_as_dataset(
dfs=[pd.DataFrame({"x": [1, 2, 3], "y": [3, 4, 5]})],
dataset_uuid="abc123",
store=store_url,
)
if __name__ == "__main__":
create_plateau_dataset()
with dask.config.set({"dataframe.query-planning": False}):
ddf = read_dataset_as_ddf(dataset_uuid="abc123", store=store_url)
print(hasattr(ddf, "_expr")) # True Here, I wouldn't know which modules to reload to get the desired behavior. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I mentioned this in #11067, but maybe this deserves its own issue: I find it difficult to turn off query planning using the Python API. Using
dask.config.set
only works ifdask.dataframe
hasn't been imported up until that point.This works (query planner is turned off):
This doesn't work (query planner is turned on):
When I write library code, I typically can't control what users already imported.
Yes, adding
dd = importlib.reload(dd)
inside the context also fixes the issue in this example, but that doesn't work in all settings.E.g., imagine that I write a library with a function that some user can call with a dask dataframe:
Is there a clever way around this? I have drastic ideas like building a conda package that sets the
DASK_DATAFRAME__QUERY_PLANNING
environment variable, but that might be a bit much. I would much rather turn the query planner off selectively.As an aside: maybe all of this is moot once #11067 is fixed.
The text was updated successfully, but these errors were encountered: