[Feature] Bucketing using Spark Algorithm #772

pransito · 2024-12-17T13:32:08Z

I have searched the existing issues, and I could not find an existing issue for this feature

For Athena tables (not iceberg), there should be a flag available "bucketing_format" to switch between hive and spark bucketing

Is this available already?

Using bucketing as is, however, downstream spark jobs cannot use Athena bucketing.

Teams that serve their Athena tables bucketed to spark jobs.

perhaps

nicor88 · 2024-12-17T14:06:00Z

In order to have this working, CTAs should be run via this:

CREATE EXTERNAL TABLE sales (...) 
... 
CLUSTERED BY (`customer_id`) INTO 8 BUCKETS 
... 
TBLPROPERTIES ( 
  'bucketing_format' = 'spark' 
)

there are 2 options to implement this:

use current bucketing properties: bucket_count and bucketed_by, and add another flag -> enable_spark_bucketing (boolean), that take care of using clustered by and setting the right table properties.
introduce clustered_by config property, and then reuse bucket_count. Doing so, via the docs, we should make explicit that clustered_by is meant to be used for Spark-compatible bucketing.

Since Dbt Labs is now the new repo maintainer, maybe it's better that someone from Dbt provides the preferred option.

Provide feedback