docs: copy edit

ibis-project · Feb 6, 2025 · 4a4a778 · 4a4a778
1 parent eee7894
commit 4a4a778
Showing 1 changed file with 19 additions and 37 deletions.
diff --git a/docs/posts/udf-rewriting/index.qmd b/docs/posts/udf-rewriting/index.qmd
@@ -12,18 +12,19 @@ image: images/tree-pruning.png
 
 ## Introduction
 
-In an ideal world, deploying machine learning models within SQL
-queries would be as simple as calling a built-in function. Unfortunately, many
-ML predictions live inside **User-Defined Functions (UDFs)** that traditional
-SQL planners can't modify, preventing optimizations like predicate pushdowns.
+In an ideal world, deploying machine learning models within SQL queries would
+be as simple as calling a built-in function. Unfortunately, many ML predictions
+live inside **User-Defined Functions (UDFs)** that traditional SQL planners
+can't modify, preventing optimizations like predicate pushdowns.
 
-This blog post will showcase how you can **prune decision tree models
-based on query filters** by dynamically rewriting your expression using
-**Ibis** and **quickgrove**, an experimental GBDT inference library built in
-Rust. We'll also show how [LetSQL](https://github.com/letsql/letsql) can
-simplify this pattern further and integrate seamlessly into your ML workflows.
+This blog post will showcase how you can **prune decision tree models based on
+query filters** by dynamically rewriting your expression using **Ibis** and
+**quickgrove**, an experimental
+[GBDT](https://developers.google.com/machine-learning/decision-forests/intro-to-gbdt)
+inference library built in Rust. We'll also show how
+[LetSQL](https://github.com/letsql/letsql) can simplify this pattern further
+and integrate seamlessly into your ML workflows.
 
----
 
 ## ML models meet SQL
 
@@ -44,19 +45,10 @@ tree paths, are evaluated for every row. With tree-based models, entire
 branches might never be evaluated at all — so the ideal scenario is to prune
 those unnecessary branches *before* evaluating them.
 
-### Why It Matters
-
-- **Flexible:** Ibis make transparent and easy to manipulate and rewrite its' IR.
-- **Simple**: Add advanced techniques e.g., predicate pushdowns in your pipeline
-without having to dive into database internals
-- **Performant**: For large datasets (hundreds of millions of rows or more),
-these optimizations add up quickly.
-
----
 
 ## Smart UDFs with Ibis
 
-**Ibis** is known for letting you write engine agnostic deferred expressions in
+**Ibis** is known for letting you write engine-agnostic deferred expressions in
 Python without losing the power of underlying engines like Spark, DuckDB, or
 BigQuery. Meanwhile, quickgrove provides a mechanism to prune Gradient Boosted
 Decision Tree (GBDT) models based on known filter conditions.
@@ -65,7 +57,7 @@ Decision Tree (GBDT) models based on known filter conditions.
 
 1. **Prune decision trees** by removing branches that can never be reached,
    given the known filters
-2. **Rewrite expressions** with pruned model into the query plan to skip
+2. **Rewrite expressions** with the pruned model into the query plan to skip
    unnecessary computations
 
 ### Understanding tree pruning
@@ -83,9 +75,8 @@ optimizer](https://arxiv.org/pdf/2206.00136) paper. It demonstrates how you can
 prune nodes in query plans for tree-based inference, so we’re taking a similar
 approach here for **forests** (GBDTs) using **Ibis.**
 
----
 
-### Quickgrove: prune-able GBDT models
+### Quickgrove: prunable GBDT models
 
 Quickgrove is an experimental package that can load GBDT JSON models and
 provides a `.prune(...)` API to remove unreachable branches. For example:
@@ -101,7 +92,6 @@ model.prune([quickgrove.Feature("color_i") < 0.2]) # Prune based on known predic
 Once pruned, the model is leaner to evaluate. Note: The results heavily depend on
 model splits and interactions with predicate pushdowns.
 
----
 
 ## Scalar PyArrow UDFs in Ibis
 
@@ -130,10 +120,9 @@ def predict_gbdt(
     return model.predict_arrays(array_list)
 ```
 
-Currently, udfs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a
+Currently, UDFs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a
 udf based on predicates it knows about.
 
----
 
 ## Making Ibis UDFs predicate-aware
 
@@ -273,8 +262,8 @@ expr = (
 See the diff below:
 
 Notice that with pruning we can drop some of the projections in
-the UDF i.e. `color_i`, `color_j` and `clarity_vvs2`. The underlying engine
-.e.g. DataFusion may optimize this further when pulling data for UDFs. We
+the UDF e.g., `color_i`, `color_j` and `clarity_vvs2`. The underlying engine
+(e.g., DataFusion) may optimize this further when pulling data for UDFs. We
 cannot completely drop these from the query expression.
 
 ```shell
@@ -290,8 +279,6 @@ cannot completely drop these from the query expression.
 )
 ```
 
----
-
 ## Putting it all together
 
 The complete example can be found [here](https://github.com/letsql/trusty/blob/main/python/examples/ibis_filter_condition.py).
@@ -321,7 +308,6 @@ When this is done, the model inside `predict_gbdt` will be  **pruned** based on
 the expression's filter conditions. This can yield significant speedups on
 large datasets (see @tbl-perf).
 
----
 
 ## Performance impact
 
@@ -350,7 +336,6 @@ Benchmark results:
 translate to real compute savings, albeit heavily dependent on how pertinent
 the filter conditions might be.
 
----
 
 ## LetSQL: simplifying UDF rewriting
 
@@ -380,15 +365,14 @@ With LetSQL, you get a **shorter, more declarative approach** to the same
 optimization logic we manually coded with Ibis. It abstracts away the gritty
 parts of rewriting your query plan.
 
----
 
 ## Best practices & considerations
 
 - **Predicate Types**: Currently, we demonstrated `column < value` logic. You
 can extend it to handle `<=`, `>`, `BETWEEN`, or even categorical splits.
 - **Quickgrove** only supports a handful of objective functions and most
 notably does not have categorical support yet. In theory, categorical variables
-make a better candidates for pruning based on filter conditions. It only
+make better candidates for pruning based on filter conditions. It only
 supports XGBoost format.
 - **Model Format**: XGBoost JSON is straightforward to parse. Other formats
 (e.g. LightGBM, scikit-learn trees) require similar logic or conversion steps.
@@ -399,15 +383,14 @@ need more robust parsing.
 same columns your trees split on. For purely adhoc queries or rarely used
 filters, the overhead of rewriting might outweigh the benefit.
 
----
 
 ## Conclusion
 
 Combining **Ibis** with a prune-friendly framework like quickgrove lets you
 optimize large-scale ML inference inside ML workflows. By **pushing filter
 predicates down into your decision trees**, you speed up queries significantly.
 
-**And with LetSQL**, you can streamline this entire process—especially if you’re
+With LetSQL, you can streamline this entire process—especially if you’re
 looking for an out-of-the-box solution that integrates with multiple engines
 along with batteries included features like caching and aggregate/window UDFs.
 For the next steps, consider experimenting with more complex models, exploring
@@ -432,4 +415,3 @@ Queries](https://arxiv.org/pdf/2206.00136)
 Post](https://ibis-project.org/posts/torch/)
 - [Multi-Engine Data Stack with
 Ibis](https://www.letsql.com/posts/multi-engine-data-stack-ibis/)
-- **LetSQL**: [Documentation](https://docs.letsql.com)