Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add caching to generating the SQL in duckdb if the query stays the same #10803

Open
1 task done
mercelino opened this issue Feb 6, 2025 · 12 comments
Open
1 task done
Labels
feature Features or general enhancements

Comments

@mercelino
Copy link

mercelino commented Feb 6, 2025

Which new backend would you like to see in Ibis?

I would like to have some caching in the part where the ibis code is getting compiled to an SQL query if it stays the same

This is an example of why it's needed:

import random
import time
import statistics

import duckdb
import ibis
import polars as pl

n_rows = 10_000
df = pl.DataFrame({
    'a': random.sample(range(0, n_rows), 500) * 20,
    'b': random.sample(range(0, n_rows), n_rows),
    'c': random.sample(range(0, n_rows), n_rows),
    'd': random.sample(range(0, n_rows), n_rows)
})
df.write_parquet('test.parquet')
ls_time_duckdb = []
for _ in range(200):
    s = time.time()
    df_duckdb = duckdb.sql("""
    SELECT a,
           MEAN(b) AS b,
           MIN(c)  AS c,
           MAX(d)  AS d,
           
           MEAN(d) FILTER (d > 100)  AS d_100,
           MEAN(d) FILTER (d > 200)  AS d_200,
           MEAN(d) FILTER (d > 300)  AS d_300,
    FROM read_parquet('test.parquet')
    GROUP BY a
    """).pl()
    del df_duckdb
    e = time.time()
    ls_time_duckdb.append(e - s)
print(statistics.mean(ls_time_duckdb))
  • The output is: 0.00650534987449646 (it can be slightly different from an execution to another)
con = ibis.connect("duckdb://")
table = con.read_parquet('test.parquet')
table_computed = (table
 .group_by('a')
 .agg(
    b=table.b.mean(),
    c=table.c.min(),
    d=table.d.max(),
    d_100=table.d.mean(where=table.d > 100),
    d_200=table.d.mean(where=table.d > 200),
    d_300=table.d.mean(where=table.d > 300),
))
ls_time_ibis = []
for _ in range(200):
    s = time.time()
    df_ibis = table_computed.to_polars()
    del df_ibis
    
    e = time.time()
    ls_time_ibis.append(e - s)
print(statistics.mean(ls_time_ibis))
  • The output is: 0.009841306209564209
  • As you can see, the code executed by ibis is slightly slower, by around 0.003 seconds, and if we execute the following:
ls_time_ibis_sql = []
for _ in range(200):
    s = time.time()
    ibis.to_sql(table_computed)
    e = time.time()
    ls_time_ibis_sql.append(e - s)
print(statistics.mean(ls_time_ibis_sql))
  • The output is: 0.003720695972442627

So then I assume that it's normal for it to be slow because it needs time to compile the python code into the DuckDB SQL code.
This is not a problem in the example because it's 0.003 seconds is very negligeable
But this time increases with the size of the query, or in other words the amount of calculations that we do.

In a projet I do a lot of heavy calculations with very big queries that i migrated from pure duckdb SQL code to Ibis, and the time for generating the SQL code is 0.2 seconds, which is very considerable.
It makes the pipeline for the testing a lot slower and the jobs in production as well, since I have a queue system with overs 15 000 tasks, and 0.2seconds on each of these tasks makes the overall job much slower.

The SQL code compiled by ibis doesn't change on any of the tasks, so I was wondering if there was a way to cache it locally and not compile it every time.

I thought in the beginning about saving the query as a file locally the first time ibis is executed and then execute the query in the file using duckdb for the next iterations, but it's not clean and I wanted to run the issue through you the team of the library first to see what you think about it and if this issue can be handled in the future.

Thank you.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mercelino mercelino added feature Features or general enhancements new backend PRs or issues related to adding new backends labels Feb 6, 2025
@mercelino
Copy link
Author

Apologies, i've put the ticket under the tag 'new backend' but it's not. I can delete it and recreate it if needed

@cpcloud cpcloud removed the new backend PRs or issues related to adding new backends label Feb 6, 2025
@cpcloud
Copy link
Member

cpcloud commented Feb 6, 2025

Thanks for the issue! (no problem on the label, I removed it)

Have you seen the .cache() method?

Would that work for your use case?

@mercelino
Copy link
Author

Yes, I've checked that method before, but unfortunately it caches the data as well, not only the SQL query.

@cpcloud
Copy link
Member

cpcloud commented Feb 6, 2025

Before we discuss a solution (e.g., caching queries), I'd like to understand the problem better.

It seems odd, or perhaps rare (but entirely possible!) that you're in a scenario where the query you're producing with Ibis is large enough to see the effect of compilation, but when you run the query, it's very fast. So fast that you can observe the compilation overhead.

Is that your scenario?

@mercelino
Copy link
Author

Yes, this is exactly the issue. The difference i have in production is around 0.2 seconds which makes sense since it's as you said, a very large query.

@cpcloud
Copy link
Member

cpcloud commented Feb 6, 2025

Hm, to understand where the performance problem lies in compilation, I need a representative query. Is there any way for you to provide that?

@mercelino
Copy link
Author

I've attached the query that i'm doing. I've anonymized it by changing the names of the columns.

github_issue.zip

@cpcloud
Copy link
Member

cpcloud commented Feb 10, 2025

After benchmarking this, it's actually the pattern system that is taking up the bulk of the time here.

I tried with #10078, and it doesn't help that much, which suggests that the problem is not a Python-interpreter issue, but perhaps an algorithmic one.

I'll dig around a bit more to see if I can pinpoint more precisely what is taking up time here.

cc @kszucs

@kszucs
Copy link
Member

kszucs commented Feb 10, 2025

@cpcloud let me know when you have a narrower scope to investigate available. Maybe profiling could help to see the most time consuming calls, then investigate further.

@cpcloud
Copy link
Member

cpcloud commented Feb 14, 2025

@kszucs I have a branch on my fork called bench-patterns that you can clone and then run python bench_compile.py to see a profiled run of the query submitted by @mercelino.

@cpcloud
Copy link
Member

cpcloud commented Feb 14, 2025

I tried playing around with a few different things but the biggest impact by far is disabling selection fusion.

It saves about 40% just to disable fusion.

@kszucs
Copy link
Member

kszucs commented Feb 17, 2025

Do you have a flame graph perhaps? I assume we do redundant work or we simply need to rewrite too many nodes during fusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

3 participants