Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LazyFrame.branch() #21645

Open
mcrumiller opened this issue Mar 7, 2025 · 2 comments
Open

Add LazyFrame.branch() #21645

mcrumiller opened this issue Mar 7, 2025 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 7, 2025

Description

I often have the case where I have multiple frames that all come from a common ancestor query path, but branch off near the end to materialize the final frame. Something like this:

lf = pl.LazyFrame(...long complex path...)

# Without this .collect(), all the prior work would be computed twice
lf = lf.collect().lazy()

lf1 = lf.with_columns(...a few operations...).sink_csv("out1.csv")
lf2 = lf.with_columns(...different operations...).sink_csv("out2.csv")

Without the lf.collect().lazy(), all of the work in the long complex path would be computed twice, as we replicate the entire query plan. An lf.branch() would effectively solve this issue without having to perform the mid-way materialization:

lf = pl.LazyFrame(...long complex path...)

# lf2 would not need to rerecompute prior to the branch
lf1 = lf.branch().with_columns(...a few operations...).sink_csv("out1.csv")
lf2 = lf.branch().with_columns(...different operations...).sink_csv("out2.csv")
@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Mar 7, 2025
@ion-elgreco
Copy link
Contributor

But lf1 and lf2 are not executed simultaneously, you would still have to keep lf in memory because you don't know when lf2 is going to be executed.

Maybe pl.execute_all(plan, plan) that finds common subnodes and rewrites them to use a single subplan which has N outs streamed to another node then it's more clear when the lf stream is finished

@ritchie46
Copy link
Member

We already have pl.collect_all. I want to make a global CSE and a cache that can be shared. Got plans for that in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants