From 4a4a778e3e217ae73bd08d8f90562594c8da4284 Mon Sep 17 00:00:00 2001 From: hussainsultan Date: Thu, 6 Feb 2025 12:09:27 -0500 Subject: [PATCH] docs: copy edit --- docs/posts/udf-rewriting/index.qmd | 56 ++++++++++-------------------- 1 file changed, 19 insertions(+), 37 deletions(-) diff --git a/docs/posts/udf-rewriting/index.qmd b/docs/posts/udf-rewriting/index.qmd index aab860e78de0..b0fedebe6990 100644 --- a/docs/posts/udf-rewriting/index.qmd +++ b/docs/posts/udf-rewriting/index.qmd @@ -12,18 +12,19 @@ image: images/tree-pruning.png ## Introduction -In an ideal world, deploying machine learning models within SQL -queries would be as simple as calling a built-in function. Unfortunately, many -ML predictions live inside **User-Defined Functions (UDFs)** that traditional -SQL planners can't modify, preventing optimizations like predicate pushdowns. +In an ideal world, deploying machine learning models within SQL queries would +be as simple as calling a built-in function. Unfortunately, many ML predictions +live inside **User-Defined Functions (UDFs)** that traditional SQL planners +can't modify, preventing optimizations like predicate pushdowns. -This blog post will showcase how you can **prune decision tree models -based on query filters** by dynamically rewriting your expression using -**Ibis** and **quickgrove**, an experimental GBDT inference library built in -Rust. We'll also show how [LetSQL](https://github.com/letsql/letsql) can -simplify this pattern further and integrate seamlessly into your ML workflows. +This blog post will showcase how you can **prune decision tree models based on +query filters** by dynamically rewriting your expression using **Ibis** and +**quickgrove**, an experimental +[GBDT](https://developers.google.com/machine-learning/decision-forests/intro-to-gbdt) +inference library built in Rust. We'll also show how +[LetSQL](https://github.com/letsql/letsql) can simplify this pattern further +and integrate seamlessly into your ML workflows. ---- ## ML models meet SQL @@ -44,19 +45,10 @@ tree paths, are evaluated for every row. With tree-based models, entire branches might never be evaluated at all — so the ideal scenario is to prune those unnecessary branches *before* evaluating them. -### Why It Matters - -- **Flexible:** Ibis make transparent and easy to manipulate and rewrite its' IR. -- **Simple**: Add advanced techniques e.g., predicate pushdowns in your pipeline -without having to dive into database internals -- **Performant**: For large datasets (hundreds of millions of rows or more), -these optimizations add up quickly. - ---- ## Smart UDFs with Ibis -**Ibis** is known for letting you write engine agnostic deferred expressions in +**Ibis** is known for letting you write engine-agnostic deferred expressions in Python without losing the power of underlying engines like Spark, DuckDB, or BigQuery. Meanwhile, quickgrove provides a mechanism to prune Gradient Boosted Decision Tree (GBDT) models based on known filter conditions. @@ -65,7 +57,7 @@ Decision Tree (GBDT) models based on known filter conditions. 1. **Prune decision trees** by removing branches that can never be reached, given the known filters -2. **Rewrite expressions** with pruned model into the query plan to skip +2. **Rewrite expressions** with the pruned model into the query plan to skip unnecessary computations ### Understanding tree pruning @@ -83,9 +75,8 @@ optimizer](https://arxiv.org/pdf/2206.00136) paper. It demonstrates how you can prune nodes in query plans for tree-based inference, so we’re taking a similar approach here for **forests** (GBDTs) using **Ibis.** ---- -### Quickgrove: prune-able GBDT models +### Quickgrove: prunable GBDT models Quickgrove is an experimental package that can load GBDT JSON models and provides a `.prune(...)` API to remove unreachable branches. For example: @@ -101,7 +92,6 @@ model.prune([quickgrove.Feature("color_i") < 0.2]) # Prune based on known predic Once pruned, the model is leaner to evaluate. Note: The results heavily depend on model splits and interactions with predicate pushdowns. ---- ## Scalar PyArrow UDFs in Ibis @@ -130,10 +120,9 @@ def predict_gbdt( return model.predict_arrays(array_list) ``` -Currently, udfs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a +Currently, UDFs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a udf based on predicates it knows about. ---- ## Making Ibis UDFs predicate-aware @@ -273,8 +262,8 @@ expr = ( See the diff below: Notice that with pruning we can drop some of the projections in -the UDF i.e. `color_i`, `color_j` and `clarity_vvs2`. The underlying engine -.e.g. DataFusion may optimize this further when pulling data for UDFs. We +the UDF e.g., `color_i`, `color_j` and `clarity_vvs2`. The underlying engine +(e.g., DataFusion) may optimize this further when pulling data for UDFs. We cannot completely drop these from the query expression. ```shell @@ -290,8 +279,6 @@ cannot completely drop these from the query expression. ) ``` ---- - ## Putting it all together The complete example can be found [here](https://github.com/letsql/trusty/blob/main/python/examples/ibis_filter_condition.py). @@ -321,7 +308,6 @@ When this is done, the model inside `predict_gbdt` will be **pruned** based on the expression's filter conditions. This can yield significant speedups on large datasets (see @tbl-perf). ---- ## Performance impact @@ -350,7 +336,6 @@ Benchmark results: translate to real compute savings, albeit heavily dependent on how pertinent the filter conditions might be. ---- ## LetSQL: simplifying UDF rewriting @@ -380,7 +365,6 @@ With LetSQL, you get a **shorter, more declarative approach** to the same optimization logic we manually coded with Ibis. It abstracts away the gritty parts of rewriting your query plan. ---- ## Best practices & considerations @@ -388,7 +372,7 @@ parts of rewriting your query plan. can extend it to handle `<=`, `>`, `BETWEEN`, or even categorical splits. - **Quickgrove** only supports a handful of objective functions and most notably does not have categorical support yet. In theory, categorical variables -make a better candidates for pruning based on filter conditions. It only +make better candidates for pruning based on filter conditions. It only supports XGBoost format. - **Model Format**: XGBoost JSON is straightforward to parse. Other formats (e.g. LightGBM, scikit-learn trees) require similar logic or conversion steps. @@ -399,7 +383,6 @@ need more robust parsing. same columns your trees split on. For purely adhoc queries or rarely used filters, the overhead of rewriting might outweigh the benefit. ---- ## Conclusion @@ -407,7 +390,7 @@ Combining **Ibis** with a prune-friendly framework like quickgrove lets you optimize large-scale ML inference inside ML workflows. By **pushing filter predicates down into your decision trees**, you speed up queries significantly. -**And with LetSQL**, you can streamline this entire process—especially if you’re +With LetSQL, you can streamline this entire process—especially if you’re looking for an out-of-the-box solution that integrates with multiple engines along with batteries included features like caching and aggregate/window UDFs. For the next steps, consider experimenting with more complex models, exploring @@ -432,4 +415,3 @@ Queries](https://arxiv.org/pdf/2206.00136) Post](https://ibis-project.org/posts/torch/) - [Multi-Engine Data Stack with Ibis](https://www.letsql.com/posts/multi-engine-data-stack-ibis/) -- **LetSQL**: [Documentation](https://docs.letsql.com)