Skip to content

Commit b5446ef

Browse files
Upgrade datafusion 39 (#728)
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1 `datafusion-common` also depends on `pyo3`, so they need to be upgraded together. * feat: remove GetIndexField datafusion replaced Expr::GetIndexField with a FieldAccessor trait. Ref apache/datafusion#10568 Ref apache/datafusion#10769 * feat: update ScalarFunction The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream. Ref apache/datafusion#10325 * feat: incorporate upstream array_slice fixes Fixes #670 * update ExectionPlan::children impl for DatasetExec Ref apache/datafusion#10543 * update value_interval_daytime Ref apache/arrow-rs#5769 * update regexp_replace and regexp_match Fixes #677 * add gil-refs feature to pyo3 This silences pyo3's deprecation warnings for its new Bounds api. It's the 1st step of the migration, and should be removed before merge. Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021 * fix signature for octet_length Ref apache/datafusion#10726 * update signature for covar_samp AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by Ref apache/datafusion#10545 Ref apache/datafusion#10492 * convert covar_pop to expr_fn api Ref: https://github.com/apache/datafusion/pull/10418/files * convert median to expr_fn api Ref apache/datafusion#10644 * convert variance sample to UDF Ref apache/datafusion#10667 * convert first_value and last_value to UDFs Ref apache/datafusion#10648 * checkpointing with a few todos to fix remaining compile errors * impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano * convert sum aggregate function to UDF * remove unnecessary clone on double reference * apply cargo fmt * remove duplicate allow-dead-code annotation * update tpch examples for new pyarrow interval Fixes #665 * marked q11 tpch example as expected fail Ref #730 * add default stride of None back to array_slice
1 parent 860283a commit b5446ef

23 files changed

+467
-327
lines changed

Cargo.lock

Lines changed: 321 additions & 219 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
[package]
1919
name = "datafusion-python"
20-
version = "38.0.1"
20+
version = "39.0.0"
2121
homepage = "https://datafusion.apache.org/python"
2222
repository = "https://github.com/apache/datafusion-python"
2323
authors = ["Apache DataFusion <[email protected]>"]
@@ -36,28 +36,28 @@ substrait = ["dep:datafusion-substrait"]
3636
[dependencies]
3737
tokio = { version = "1.35", features = ["macros", "rt", "rt-multi-thread", "sync"] }
3838
rand = "0.8"
39-
pyo3 = { version = "0.20", features = ["extension-module", "abi3", "abi3-py38"] }
40-
datafusion = { version = "38.0.0", features = ["pyarrow", "avro", "unicode_expressions"] }
41-
datafusion-common = { version = "38.0.0", features = ["pyarrow"] }
42-
datafusion-expr = "38.0.0"
43-
datafusion-functions-array = "38.0.0"
44-
datafusion-optimizer = "38.0.0"
45-
datafusion-sql = "38.0.0"
46-
datafusion-substrait = { version = "38.0.0", optional = true }
39+
pyo3 = { version = "0.21", features = ["extension-module", "abi3", "abi3-py38", "gil-refs"] }
40+
datafusion = { version = "39.0.0", features = ["pyarrow", "avro", "unicode_expressions"] }
41+
datafusion-common = { version = "39.0.0", features = ["pyarrow"] }
42+
datafusion-expr = "39.0.0"
43+
datafusion-functions-array = "39.0.0"
44+
datafusion-optimizer = "39.0.0"
45+
datafusion-sql = "39.0.0"
46+
datafusion-substrait = { version = "39.0.0", optional = true }
4747
prost = "0.12"
4848
prost-types = "0.12"
4949
uuid = { version = "1.8", features = ["v4"] }
5050
mimalloc = { version = "0.1", optional = true, default-features = false, features = ["local_dynamic_tls"] }
5151
async-trait = "0.1"
5252
futures = "0.3"
53-
object_store = { version = "0.9.1", features = ["aws", "gcp", "azure"] }
53+
object_store = { version = "0.10.1", features = ["aws", "gcp", "azure"] }
5454
parking_lot = "0.12"
5555
regex-syntax = "0.8.1"
5656
syn = "2.0.43"
5757
url = "2.2"
5858

5959
[build-dependencies]
60-
pyo3-build-config = "0.20.0"
60+
pyo3-build-config = "0.21"
6161

6262
[lib]
6363
name = "datafusion_python"

docs/source/user-guide/common-operations/functions.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,12 +92,13 @@ DataFusion offers a range of helpful options.
9292
f.left(col('"Name"'), literal(4)).alias("code")
9393
)
9494
95-
This also includes the functions for regular expressions like :func:`.regexp_match`
95+
This also includes the functions for regular expressions like :func:`.regexp_replace` and :func:`.regexp_match`
9696

9797
.. ipython:: python
9898
9999
df.select(
100100
f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
101+
f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
101102
)
102103
103104

examples/tpch/_tests.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,10 @@ def check_q17(df):
7272
("q08_market_share", "q8"),
7373
("q09_product_type_profit_measure", "q9"),
7474
("q10_returned_item_reporting", "q10"),
75-
("q11_important_stock_identification", "q11"),
75+
pytest.param(
76+
"q11_important_stock_identification", "q11",
77+
marks=pytest.mark.xfail # https://github.com/apache/datafusion-python/issues/730
78+
),
7679
("q12_ship_mode_order_priority", "q12"),
7780
("q13_customer_distribution", "q13"),
7881
("q14_promotion_effect", "q14"),

examples/tpch/q01_pricing_summary_report.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,7 @@
4848
# want to report results for. It should be between 60-120 days before the end.
4949
DAYS_BEFORE_FINAL = 90
5050

51-
# Note: this is a hack on setting the values. It should be set differently once
52-
# https://github.com/apache/datafusion-python/issues/665 is resolved.
53-
interval = pa.scalar((0, 0, DAYS_BEFORE_FINAL), type=pa.month_day_nano_interval())
51+
interval = pa.scalar((0, DAYS_BEFORE_FINAL, 0), type=pa.month_day_nano_interval())
5452

5553
print("Final date in database:", greatest_ship_date)
5654

examples/tpch/q04_order_priority_checking.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,7 @@
4949
# Create a date object from the string
5050
date = datetime.strptime(DATE_OF_INTEREST, "%Y-%m-%d").date()
5151

52-
# Note: this is a hack on setting the values. It should be set differently once
53-
# https://github.com/apache/datafusion-python/issues/665 is resolved.
54-
interval = pa.scalar((0, 0, INTERVAL_DAYS), type=pa.month_day_nano_interval())
52+
interval = pa.scalar((0, INTERVAL_DAYS, 0), type=pa.month_day_nano_interval())
5553

5654
# Limit results to cases where commitment date before receipt date
5755
# Aggregate the results so we only get one row to join with the order table.

examples/tpch/q05_local_supplier_volume.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,7 @@
4141

4242
date = datetime.strptime(DATE_OF_INTEREST, "%Y-%m-%d").date()
4343

44-
# Note: this is a hack on setting the values. It should be set differently once
45-
# https://github.com/apache/datafusion-python/issues/665 is resolved.
46-
interval = pa.scalar((0, 0, INTERVAL_DAYS), type=pa.month_day_nano_interval())
44+
interval = pa.scalar((0, INTERVAL_DAYS, 0), type=pa.month_day_nano_interval())
4745

4846
# Load the dataframes we need
4947

examples/tpch/q06_forecasting_revenue_change.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,7 @@
4545

4646
date = datetime.strptime(DATE_OF_INTEREST, "%Y-%m-%d").date()
4747

48-
# Note: this is a hack on setting the values. It should be set differently once
49-
# https://github.com/apache/datafusion-python/issues/665 is resolved.
50-
interval = pa.scalar((0, 0, INTERVAL_DAYS), type=pa.month_day_nano_interval())
48+
interval = pa.scalar((0, INTERVAL_DAYS, 0), type=pa.month_day_nano_interval())
5149

5250
# Load the dataframes we need
5351

examples/tpch/q10_returned_item_reporting.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,7 @@
3838

3939
date_start_of_quarter = lit(datetime.strptime(DATE_START_OF_QUARTER, "%Y-%m-%d").date())
4040

41-
# Note: this is a hack on setting the values. It should be set differently once
42-
# https://github.com/apache/datafusion-python/issues/665 is resolved.
43-
interval_one_quarter = lit(pa.scalar((0, 0, 92), type=pa.month_day_nano_interval()))
41+
interval_one_quarter = lit(pa.scalar((0, 92, 0), type=pa.month_day_nano_interval()))
4442

4543
# Load the dataframes we need
4644

examples/tpch/q12_ship_mode_order_priority.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,7 @@
5151

5252
date = datetime.strptime(DATE_OF_INTEREST, "%Y-%m-%d").date()
5353

54-
# Note: this is a hack on setting the values. It should be set differently once
55-
# https://github.com/apache/datafusion-python/issues/665 is resolved.
56-
interval = pa.scalar((0, 0, 365), type=pa.month_day_nano_interval())
54+
interval = pa.scalar((0, 365, 0), type=pa.month_day_nano_interval())
5755

5856

5957
df = df_lineitem.filter(col("l_receiptdate") >= lit(date)).filter(

0 commit comments

Comments
 (0)