-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add caching to generating the SQL in duckdb if the query stays the same #10803
Comments
Apologies, i've put the ticket under the tag 'new backend' but it's not. I can delete it and recreate it if needed |
Thanks for the issue! (no problem on the label, I removed it) Have you seen the Would that work for your use case? |
Yes, I've checked that method before, but unfortunately it caches the data as well, not only the SQL query. |
Before we discuss a solution (e.g., caching queries), I'd like to understand the problem better. It seems odd, or perhaps rare (but entirely possible!) that you're in a scenario where the query you're producing with Ibis is large enough to see the effect of compilation, but when you run the query, it's very fast. So fast that you can observe the compilation overhead. Is that your scenario? |
Yes, this is exactly the issue. The difference i have in production is around 0.2 seconds which makes sense since it's as you said, a very large query. |
Hm, to understand where the performance problem lies in compilation, I need a representative query. Is there any way for you to provide that? |
I've attached the query that i'm doing. I've anonymized it by changing the names of the columns. |
After benchmarking this, it's actually the pattern system that is taking up the bulk of the time here. I tried with #10078, and it doesn't help that much, which suggests that the problem is not a Python-interpreter issue, but perhaps an algorithmic one. I'll dig around a bit more to see if I can pinpoint more precisely what is taking up time here. cc @kszucs |
@cpcloud let me know when you have a narrower scope to investigate available. Maybe profiling could help to see the most time consuming calls, then investigate further. |
@kszucs I have a branch on my fork called |
I tried playing around with a few different things but the biggest impact by far is disabling selection fusion. It saves about 40% just to disable fusion. |
Do you have a flame graph perhaps? I assume we do redundant work or we simply need to rewrite too many nodes during fusion. |
Which new backend would you like to see in Ibis?
I would like to have some caching in the part where the ibis code is getting compiled to an SQL query if it stays the same
This is an example of why it's needed:
So then I assume that it's normal for it to be slow because it needs time to compile the python code into the DuckDB SQL code.
This is not a problem in the example because it's 0.003 seconds is very negligeable
But this time increases with the size of the query, or in other words the amount of calculations that we do.
In a projet I do a lot of heavy calculations with very big queries that i migrated from pure duckdb SQL code to Ibis, and the time for generating the SQL code is 0.2 seconds, which is very considerable.
It makes the pipeline for the testing a lot slower and the jobs in production as well, since I have a queue system with overs 15 000 tasks, and 0.2seconds on each of these tasks makes the overall job much slower.
The SQL code compiled by ibis doesn't change on any of the tasks, so I was wondering if there was a way to cache it locally and not compile it every time.
I thought in the beginning about saving the query as a file locally the first time ibis is executed and then execute the query in the file using duckdb for the next iterations, but it's not clean and I wanted to run the issue through you the team of the library first to see what you think about it and if this issue can be handled in the future.
Thank you.
Code of Conduct
The text was updated successfully, but these errors were encountered: