|
| 1 | +# S3Table with OSS Spark on EKS Guide |
| 2 | + |
| 3 | +This guide provides step-by-step instructions for setting up and running a Spark job on Amazon EKS using S3Table for data storage. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +- Latest version of AWS CLI installed (must include S3Tables API support) |
| 8 | + |
| 9 | +## Step 1: Deploy Spark Cluster on EKS |
| 10 | + |
| 11 | +Follow the steps to deploy Spark Cluster on EKS |
| 12 | + |
| 13 | +[Spark Operator on EKS with YuniKorn Scheduler](https://awslabs.github.io/data-on-eks/docs/blueprints/data-analytics/spark-operator-yunikorn#prerequisites) |
| 14 | + |
| 15 | +Once your cluster is up and running, proceed with the following steps to execute a sample Spark job using S3Tables. |
| 16 | + |
| 17 | +## Step 2: Create Test Data for the job |
| 18 | + |
| 19 | +Navigate to the example directory and Generate sample data: |
| 20 | + |
| 21 | +```sh |
| 22 | +cd analytics/terraform/spark-k8s-operator/examples/s3-tables |
| 23 | +./input-data-gen.sh |
| 24 | +``` |
| 25 | + |
| 26 | +This will create a file called `employee_data.csv` locally with 100 records. Modify the script to adjust the number of records as needed. |
| 27 | + |
| 28 | +## Step 3: Upload Test Input data to your S3 Bucket |
| 29 | + |
| 30 | +Replace `<YOUR_S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command. |
| 31 | + |
| 32 | +```sh |
| 33 | +aws s3 cp employee_data.csv s3://<S3_BUCKET>/s3table-example/input/ |
| 34 | +``` |
| 35 | + |
| 36 | +## Step 4: Upload PySpark Script to S3 Bucket |
| 37 | + |
| 38 | +Replace `<S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command to upload sample Spark job to S3 buckets. |
| 39 | + |
| 40 | +```sh |
| 41 | +aws s3 cp s3table-iceberg-pyspark.py s3://<S3_BUCKET>/s3table-example/scripts/ |
| 42 | +``` |
| 43 | + |
| 44 | +## Step 5: Create S3Table |
| 45 | + |
| 46 | +Replace <REGION> and <S3TABLE_BUCKET_NAME> with desired names. |
| 47 | + |
| 48 | +```sh |
| 49 | +aws s3tables create-table-bucket \ |
| 50 | + --region "<REGION>" \ |
| 51 | + --name "<S3TABLE_BUCKET_NAME>" |
| 52 | +``` |
| 53 | + |
| 54 | +Make note of the S3TABLE ARN generated by this command. |
| 55 | + |
| 56 | +## Step 6: Update Spark Operator YAML File |
| 57 | + |
| 58 | + - Open `s3table-spark-operator.yaml` file in your preferred text editor. |
| 59 | + - Replace `<S3_BUCKET>` with your S3 bucket created by this blueprint(Check Terraform outputs). S3 Bucket where you copied test data and sample spark job in the above steps. |
| 60 | + - REPLACE `<S3TABLE_ARN>` with your S3 Table ARN. |
| 61 | + |
| 62 | +## Step 7: Execute Spark Job |
| 63 | + |
| 64 | +Apply the updated YAML file to your Kubernetes cluster to submit the Spark Job. |
| 65 | + |
| 66 | +```sh |
| 67 | +cd analytics/terraform/spark-k8s-operator/examples/s3-tables |
| 68 | +kubectl apply -f s3table-spark-operator.yaml |
| 69 | +``` |
| 70 | + |
| 71 | +## Step 8: Verify the Spark Driver log for the output |
| 72 | + |
| 73 | +Check the Spark driver logs to verify job progress and output: |
| 74 | + |
| 75 | +```sh |
| 76 | +kubectl logs <spark-driver-pod-name> -n spark-team-a |
| 77 | +``` |
| 78 | + |
| 79 | +## Step 9: Verify the S3Table using S3Table API |
| 80 | + |
| 81 | +Use the S3Table API to confirm the table was created successfully. Just replace the `<ACCOUNT_ID>` and run the command. |
| 82 | + |
| 83 | +```sh |
| 84 | +aws s3tables get-table --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table |
| 85 | +``` |
| 86 | + |
| 87 | +Output looks like below. |
| 88 | + |
| 89 | +```json |
| 90 | +{ |
| 91 | + "name": "employee_s3_table", |
| 92 | + "type": "customer", |
| 93 | + "tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf", |
| 94 | + "namespace": [ |
| 95 | + "doeks_namespace" |
| 96 | + ], |
| 97 | + "versionToken": "aafc39ddd462690d2a0c", |
| 98 | + "metadataLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3/metadata/00004-62cc4be3-59b5-4647-a78d-1cdf69ec5ed8.metadata.json", |
| 99 | + "warehouseLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3", |
| 100 | + "createdAt": "2025-01-07T22:14:48.689581+00:00", |
| 101 | + "createdBy": "<ACCOUNT_ID>", |
| 102 | + "modifiedAt": "2025-01-09T00:06:09.222917+00:00", |
| 103 | + "ownerAccountId": "<ACCOUNT_ID>", |
| 104 | + "format": "ICEBERG" |
| 105 | +} |
| 106 | +``` |
| 107 | + |
| 108 | +Monitor the table maintenance job status: |
| 109 | + |
| 110 | +```sh |
| 111 | +aws s3tables get-table-maintenance-job-status --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table |
| 112 | +``` |
| 113 | + |
| 114 | +This command provides information about Iceberg compaction, snapshot management, and unreferenced file removal processes. |
| 115 | + |
| 116 | +```json |
| 117 | +{ |
| 118 | + "tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf", |
| 119 | + "status": { |
| 120 | + "icebergCompaction": { |
| 121 | + "status": "Successful", |
| 122 | + "lastRunTimestamp": "2025-01-08T01:18:08.857000+00:00" |
| 123 | + }, |
| 124 | + "icebergSnapshotManagement": { |
| 125 | + "status": "Successful", |
| 126 | + "lastRunTimestamp": "2025-01-08T22:17:08.811000+00:00" |
| 127 | + }, |
| 128 | + "icebergUnreferencedFileRemoval": { |
| 129 | + "status": "Successful", |
| 130 | + "lastRunTimestamp": "2025-01-08T22:17:10.377000+00:00" |
| 131 | + } |
| 132 | + } |
| 133 | +} |
| 134 | +``` |
| 135 | + |
| 136 | +## Step10: Clean up |
| 137 | + |
| 138 | +Delete the table. |
| 139 | + |
| 140 | +```bash |
| 141 | +aws s3tables delete-table \ |
| 142 | + --namespace doeks_namespace \ |
| 143 | + --table-bucket-arn ${S3TABLE_ARN} \ |
| 144 | + --name employee_s3_table |
| 145 | +``` |
| 146 | + |
| 147 | +Delete the namespace. |
| 148 | + |
| 149 | +```bash |
| 150 | +aws s3tables delete-namespace \ |
| 151 | + --namespace doeks_namespace \ |
| 152 | + --table-bucket-arn ${S3TABLE_ARN} |
| 153 | +``` |
| 154 | + |
| 155 | +Finally, delete the bucket table |
| 156 | + |
| 157 | +```bash |
| 158 | +aws s3tables delete-table-bucket \ |
| 159 | + --region "<REGION>" \ |
| 160 | + --table-bucket-arn ${S3TABLE_ARN} |
| 161 | +``` |
| 162 | + |
| 163 | + |
| 164 | +# Conclusion |
| 165 | +You have successfully set up and run a Spark job on Amazon EKS using S3Table for data storage. This setup provides a scalable and efficient way to process large datasets using Spark on Kubernetes with the added benefits of S3Table's data management capabilities. |
| 166 | + |
| 167 | +For more advanced usage, refer to the official AWS documentation on S3Table and Spark on EKS. |
0 commit comments