Skip to content

Commit

Permalink
docs: copy edit
Browse files Browse the repository at this point in the history
  • Loading branch information
hussainsultan committed Feb 6, 2025
1 parent eee7894 commit 4a4a778
Showing 1 changed file with 19 additions and 37 deletions.
56 changes: 19 additions & 37 deletions docs/posts/udf-rewriting/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,19 @@ image: images/tree-pruning.png

## Introduction

In an ideal world, deploying machine learning models within SQL
queries would be as simple as calling a built-in function. Unfortunately, many
ML predictions live inside **User-Defined Functions (UDFs)** that traditional
SQL planners can't modify, preventing optimizations like predicate pushdowns.
In an ideal world, deploying machine learning models within SQL queries would
be as simple as calling a built-in function. Unfortunately, many ML predictions
live inside **User-Defined Functions (UDFs)** that traditional SQL planners
can't modify, preventing optimizations like predicate pushdowns.

This blog post will showcase how you can **prune decision tree models
based on query filters** by dynamically rewriting your expression using
**Ibis** and **quickgrove**, an experimental GBDT inference library built in
Rust. We'll also show how [LetSQL](https://github.com/letsql/letsql) can
simplify this pattern further and integrate seamlessly into your ML workflows.
This blog post will showcase how you can **prune decision tree models based on
query filters** by dynamically rewriting your expression using **Ibis** and
**quickgrove**, an experimental
[GBDT](https://developers.google.com/machine-learning/decision-forests/intro-to-gbdt)
inference library built in Rust. We'll also show how
[LetSQL](https://github.com/letsql/letsql) can simplify this pattern further
and integrate seamlessly into your ML workflows.

---

## ML models meet SQL

Expand All @@ -44,19 +45,10 @@ tree paths, are evaluated for every row. With tree-based models, entire
branches might never be evaluated at all — so the ideal scenario is to prune
those unnecessary branches *before* evaluating them.

### Why It Matters

- **Flexible:** Ibis make transparent and easy to manipulate and rewrite its' IR.
- **Simple**: Add advanced techniques e.g., predicate pushdowns in your pipeline
without having to dive into database internals
- **Performant**: For large datasets (hundreds of millions of rows or more),
these optimizations add up quickly.

---

## Smart UDFs with Ibis

**Ibis** is known for letting you write engine agnostic deferred expressions in
**Ibis** is known for letting you write engine-agnostic deferred expressions in
Python without losing the power of underlying engines like Spark, DuckDB, or
BigQuery. Meanwhile, quickgrove provides a mechanism to prune Gradient Boosted
Decision Tree (GBDT) models based on known filter conditions.
Expand All @@ -65,7 +57,7 @@ Decision Tree (GBDT) models based on known filter conditions.

1. **Prune decision trees** by removing branches that can never be reached,
given the known filters
2. **Rewrite expressions** with pruned model into the query plan to skip
2. **Rewrite expressions** with the pruned model into the query plan to skip
unnecessary computations

### Understanding tree pruning
Expand All @@ -83,9 +75,8 @@ optimizer](https://arxiv.org/pdf/2206.00136) paper. It demonstrates how you can
prune nodes in query plans for tree-based inference, so we’re taking a similar
approach here for **forests** (GBDTs) using **Ibis.**

---

### Quickgrove: prune-able GBDT models
### Quickgrove: prunable GBDT models

Quickgrove is an experimental package that can load GBDT JSON models and
provides a `.prune(...)` API to remove unreachable branches. For example:
Expand All @@ -101,7 +92,6 @@ model.prune([quickgrove.Feature("color_i") < 0.2]) # Prune based on known predic
Once pruned, the model is leaner to evaluate. Note: The results heavily depend on
model splits and interactions with predicate pushdowns.

---

## Scalar PyArrow UDFs in Ibis

Expand Down Expand Up @@ -130,10 +120,9 @@ def predict_gbdt(
return model.predict_arrays(array_list)
```

Currently, udfs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a
Currently, UDFs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a
udf based on predicates it knows about.

---

## Making Ibis UDFs predicate-aware

Expand Down Expand Up @@ -273,8 +262,8 @@ expr = (
See the diff below:

Notice that with pruning we can drop some of the projections in
the UDF i.e. `color_i`, `color_j` and `clarity_vvs2`. The underlying engine
.e.g. DataFusion may optimize this further when pulling data for UDFs. We
the UDF e.g., `color_i`, `color_j` and `clarity_vvs2`. The underlying engine
(e.g., DataFusion) may optimize this further when pulling data for UDFs. We
cannot completely drop these from the query expression.

```shell
Expand All @@ -290,8 +279,6 @@ cannot completely drop these from the query expression.
)
```
---
## Putting it all together
The complete example can be found [here](https://github.com/letsql/trusty/blob/main/python/examples/ibis_filter_condition.py).
Expand Down Expand Up @@ -321,7 +308,6 @@ When this is done, the model inside `predict_gbdt` will be **pruned** based on
the expression's filter conditions. This can yield significant speedups on
large datasets (see @tbl-perf).
---
## Performance impact
Expand Down Expand Up @@ -350,7 +336,6 @@ Benchmark results:
translate to real compute savings, albeit heavily dependent on how pertinent
the filter conditions might be.
---
## LetSQL: simplifying UDF rewriting
Expand Down Expand Up @@ -380,15 +365,14 @@ With LetSQL, you get a **shorter, more declarative approach** to the same
optimization logic we manually coded with Ibis. It abstracts away the gritty
parts of rewriting your query plan.
---
## Best practices & considerations
- **Predicate Types**: Currently, we demonstrated `column < value` logic. You
can extend it to handle `<=`, `>`, `BETWEEN`, or even categorical splits.
- **Quickgrove** only supports a handful of objective functions and most
notably does not have categorical support yet. In theory, categorical variables
make a better candidates for pruning based on filter conditions. It only
make better candidates for pruning based on filter conditions. It only
supports XGBoost format.
- **Model Format**: XGBoost JSON is straightforward to parse. Other formats
(e.g. LightGBM, scikit-learn trees) require similar logic or conversion steps.
Expand All @@ -399,15 +383,14 @@ need more robust parsing.
same columns your trees split on. For purely adhoc queries or rarely used
filters, the overhead of rewriting might outweigh the benefit.
---
## Conclusion
Combining **Ibis** with a prune-friendly framework like quickgrove lets you
optimize large-scale ML inference inside ML workflows. By **pushing filter
predicates down into your decision trees**, you speed up queries significantly.
**And with LetSQL**, you can streamline this entire process—especially if you’re
With LetSQL, you can streamline this entire process—especially if you’re
looking for an out-of-the-box solution that integrates with multiple engines
along with batteries included features like caching and aggregate/window UDFs.
For the next steps, consider experimenting with more complex models, exploring
Expand All @@ -432,4 +415,3 @@ Queries](https://arxiv.org/pdf/2206.00136)
Post](https://ibis-project.org/posts/torch/)
- [Multi-Engine Data Stack with
Ibis](https://www.letsql.com/posts/multi-engine-data-stack-ibis/)
- **LetSQL**: [Documentation](https://docs.letsql.com)

0 comments on commit 4a4a778

Please sign in to comment.