Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Bucketing using Spark Algorithm #772

Open
1 task done
pransito opened this issue Dec 17, 2024 · 1 comment
Open
1 task done

[Feature] Bucketing using Spark Algorithm #772

pransito opened this issue Dec 17, 2024 · 1 comment

Comments

@pransito
Copy link

Is this your first time submitting a feature request?

  • I have searched the existing issues, and I could not find an existing issue for this feature

Describe the feature

For Athena tables (not iceberg), there should be a flag available "bucketing_format" to switch between hive and spark bucketing

https://docs.aws.amazon.com/athena/latest/ug/ctas-partitioning-and-bucketing-what-is-bucketing.html#ctas-partitioning-and-bucketing-hive-and-spark-support

Is this available already?

Describe alternatives you've considered

Using bucketing as is, however, downstream spark jobs cannot use Athena bucketing.

Who will this benefit?

Teams that serve their Athena tables bucketed to spark jobs.

Are you interested in contributing this feature?

perhaps

Anything else?

https://docs.aws.amazon.com/athena/latest/ug/ctas-partitioning-and-bucketing-what-is-bucketing.html#ctas-partitioning-and-bucketing-hive-and-spark-support

@nicor88
Copy link
Contributor

nicor88 commented Dec 17, 2024

In order to have this working, CTAs should be run via this:

CREATE EXTERNAL TABLE sales (...) 
... 
CLUSTERED BY (`customer_id`) INTO 8 BUCKETS 
... 
TBLPROPERTIES ( 
  'bucketing_format' = 'spark' 
)

there are 2 options to implement this:

  1. use current bucketing properties: bucket_count and bucketed_by, and add another flag -> enable_spark_bucketing (boolean), that take care of using clustered by and setting the right table properties.
  2. introduce clustered_by config property, and then reuse bucket_count. Doing so, via the docs, we should make explicit that clustered_by is meant to be used for Spark-compatible bucketing.

Since Dbt Labs is now the new repo maintainer, maybe it's better that someone from Dbt provides the preferred option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants