You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to have this working, CTAs should be run via this:
CREATE EXTERNAL TABLE sales (...)
...
CLUSTERED BY (`customer_id`) INTO 8 BUCKETS
...
TBLPROPERTIES (
'bucketing_format' = 'spark'
)
there are 2 options to implement this:
use current bucketing properties: bucket_count and bucketed_by, and add another flag -> enable_spark_bucketing (boolean), that take care of using clustered by and setting the right table properties.
introduce clustered_by config property, and then reuse bucket_count. Doing so, via the docs, we should make explicit that clustered_by is meant to be used for Spark-compatible bucketing.
Since Dbt Labs is now the new repo maintainer, maybe it's better that someone from Dbt provides the preferred option.
Is this your first time submitting a feature request?
Describe the feature
For Athena tables (not iceberg), there should be a flag available "bucketing_format" to switch between hive and spark bucketing
https://docs.aws.amazon.com/athena/latest/ug/ctas-partitioning-and-bucketing-what-is-bucketing.html#ctas-partitioning-and-bucketing-hive-and-spark-support
Is this available already?
Describe alternatives you've considered
Using bucketing as is, however, downstream spark jobs cannot use Athena bucketing.
Who will this benefit?
Teams that serve their Athena tables bucketed to spark jobs.
Are you interested in contributing this feature?
perhaps
Anything else?
https://docs.aws.amazon.com/athena/latest/ug/ctas-partitioning-and-bucketing-what-is-bucketing.html#ctas-partitioning-and-bucketing-hive-and-spark-support
The text was updated successfully, but these errors were encountered: