feat: Add caching to generating the SQL in duckdb if the query stays the same #10803

mercelino · 2025-02-06T09:40:57Z

Which new backend would you like to see in Ibis?

I would like to have some caching in the part where the ibis code is getting compiled to an SQL query if it stays the same

This is an example of why it's needed:

import random
import time
import statistics

import duckdb
import ibis
import polars as pl

n_rows = 10_000
df = pl.DataFrame({
    'a': random.sample(range(0, n_rows), 500) * 20,
    'b': random.sample(range(0, n_rows), n_rows),
    'c': random.sample(range(0, n_rows), n_rows),
    'd': random.sample(range(0, n_rows), n_rows)
})
df.write_parquet('test.parquet')

ls_time_duckdb = []
for _ in range(200):
    s = time.time()
    df_duckdb = duckdb.sql("""
    SELECT a,
           MEAN(b) AS b,
           MIN(c)  AS c,
           MAX(d)  AS d,
           
           MEAN(d) FILTER (d > 100)  AS d_100,
           MEAN(d) FILTER (d > 200)  AS d_200,
           MEAN(d) FILTER (d > 300)  AS d_300,
    FROM read_parquet('test.parquet')
    GROUP BY a
    """).pl()
    del df_duckdb
    e = time.time()
    ls_time_duckdb.append(e - s)
print(statistics.mean(ls_time_duckdb))

The output is: 0.00650534987449646 (it can be slightly different from an execution to another)

con = ibis.connect("duckdb://")
table = con.read_parquet('test.parquet')
table_computed = (table
 .group_by('a')
 .agg(
    b=table.b.mean(),
    c=table.c.min(),
    d=table.d.max(),
    d_100=table.d.mean(where=table.d > 100),
    d_200=table.d.mean(where=table.d > 200),
    d_300=table.d.mean(where=table.d > 300),
))
ls_time_ibis = []
for _ in range(200):
    s = time.time()
    df_ibis = table_computed.to_polars()
    del df_ibis
    
    e = time.time()
    ls_time_ibis.append(e - s)
print(statistics.mean(ls_time_ibis))

The output is: 0.009841306209564209
As you can see, the code executed by ibis is slightly slower, by around 0.003 seconds, and if we execute the following:

ls_time_ibis_sql = []
for _ in range(200):
    s = time.time()
    ibis.to_sql(table_computed)
    e = time.time()
    ls_time_ibis_sql.append(e - s)
print(statistics.mean(ls_time_ibis_sql))

The output is: 0.003720695972442627

So then I assume that it's normal for it to be slow because it needs time to compile the python code into the DuckDB SQL code.
This is not a problem in the example because it's 0.003 seconds is very negligeable
But this time increases with the size of the query, or in other words the amount of calculations that we do.

In a projet I do a lot of heavy calculations with very big queries that i migrated from pure duckdb SQL code to Ibis, and the time for generating the SQL code is 0.2 seconds, which is very considerable.
It makes the pipeline for the testing a lot slower and the jobs in production as well, since I have a queue system with overs 15 000 tasks, and 0.2seconds on each of these tasks makes the overall job much slower.

The SQL code compiled by ibis doesn't change on any of the tasks, so I was wondering if there was a way to cache it locally and not compile it every time.

I thought in the beginning about saving the query as a file locally the first time ibis is executed and then execute the query in the file using duckdb for the next iterations, but it's not clean and I wanted to run the issue through you the team of the library first to see what you think about it and if this issue can be handled in the future.

Thank you.

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

mercelino · 2025-02-06T09:42:41Z

Apologies, i've put the ticket under the tag 'new backend' but it's not. I can delete it and recreate it if needed

cpcloud · 2025-02-06T09:56:23Z

Thanks for the issue! (no problem on the label, I removed it)

Have you seen the .cache() method?

Would that work for your use case?

mercelino · 2025-02-06T10:12:24Z

Yes, I've checked that method before, but unfortunately it caches the data as well, not only the SQL query.

cpcloud · 2025-02-06T12:22:00Z

Before we discuss a solution (e.g., caching queries), I'd like to understand the problem better.

It seems odd, or perhaps rare (but entirely possible!) that you're in a scenario where the query you're producing with Ibis is large enough to see the effect of compilation, but when you run the query, it's very fast. So fast that you can observe the compilation overhead.

Is that your scenario?

mercelino · 2025-02-06T12:39:39Z

Yes, this is exactly the issue. The difference i have in production is around 0.2 seconds which makes sense since it's as you said, a very large query.

cpcloud · 2025-02-06T13:10:02Z

Hm, to understand where the performance problem lies in compilation, I need a representative query. Is there any way for you to provide that?

mercelino · 2025-02-07T11:50:51Z

I've attached the query that i'm doing. I've anonymized it by changing the names of the columns.

github_issue.zip

cpcloud · 2025-02-10T15:49:07Z

After benchmarking this, it's actually the pattern system that is taking up the bulk of the time here.

I tried with #10078, and it doesn't help that much, which suggests that the problem is not a Python-interpreter issue, but perhaps an algorithmic one.

I'll dig around a bit more to see if I can pinpoint more precisely what is taking up time here.

cc @kszucs

kszucs · 2025-02-10T20:49:16Z

@cpcloud let me know when you have a narrower scope to investigate available. Maybe profiling could help to see the most time consuming calls, then investigate further.

cpcloud · 2025-02-14T14:19:58Z

@kszucs I have a branch on my fork called bench-patterns that you can clone and then run python bench_compile.py to see a profiled run of the query submitted by @mercelino.

cpcloud · 2025-02-14T14:23:27Z

I tried playing around with a few different things but the biggest impact by far is disabling selection fusion.

It saves about 40% just to disable fusion.

kszucs · 2025-02-17T12:33:27Z

Do you have a flame graph perhaps? I assume we do redundant work or we simply need to rewrite too many nodes during fusion.

mercelino added feature Features or general enhancements new backend PRs or issues related to adding new backends labels Feb 6, 2025

github-project-automation bot added this to Ibis planning and roadmap Feb 6, 2025

github-project-automation bot moved this to backlog in Ibis planning and roadmap Feb 6, 2025

cpcloud removed the new backend PRs or issues related to adding new backends label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add caching to generating the SQL in duckdb if the query stays the same #10803

feat: Add caching to generating the SQL in duckdb if the query stays the same #10803

mercelino commented Feb 6, 2025 •

edited

Loading

mercelino commented Feb 6, 2025

cpcloud commented Feb 6, 2025

mercelino commented Feb 6, 2025

cpcloud commented Feb 6, 2025

mercelino commented Feb 6, 2025

cpcloud commented Feb 6, 2025

mercelino commented Feb 7, 2025

cpcloud commented Feb 10, 2025

kszucs commented Feb 10, 2025

cpcloud commented Feb 14, 2025

cpcloud commented Feb 14, 2025

kszucs commented Feb 17, 2025

feat: Add caching to generating the SQL in duckdb if the query stays the same #10803

feat: Add caching to generating the SQL in duckdb if the query stays the same #10803

Comments

mercelino commented Feb 6, 2025 • edited Loading

Which new backend would you like to see in Ibis?

Code of Conduct

mercelino commented Feb 6, 2025

cpcloud commented Feb 6, 2025

mercelino commented Feb 6, 2025

cpcloud commented Feb 6, 2025

mercelino commented Feb 6, 2025

cpcloud commented Feb 6, 2025

mercelino commented Feb 7, 2025

cpcloud commented Feb 10, 2025

kszucs commented Feb 10, 2025

cpcloud commented Feb 14, 2025

cpcloud commented Feb 14, 2025

kszucs commented Feb 17, 2025

mercelino commented Feb 6, 2025 •

edited

Loading