Skip to content

Commit

Permalink
move documentation to website
Browse files Browse the repository at this point in the history
Signed-off-by: Manabu McCloskey <[email protected]>
  • Loading branch information
nabuskey committed Jan 14, 2025
1 parent b0910d2 commit cf78d2d
Show file tree
Hide file tree
Showing 8 changed files with 69 additions and 213 deletions.
2 changes: 1 addition & 1 deletion analytics/terraform/spark-k8s-operator/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -430,7 +430,7 @@ module "eks_data_addons" {
enable_jupyterhub = var.enable_jupyterhub
jupyterhub_helm_config = {
values = [templatefile("${path.module}/helm-values/jupyterhub-singleuser-values.yaml", {
jupyter_single_user_sa_name = kubernetes_service_account_v1.jupyterhub_single_user_sa.metadata[0].name
jupyter_single_user_sa_name = "${var.enable_jupyterhub ? kubernetes_service_account_v1.jupyterhub_single_user_sa[0].metadata[0].name : "no-tused"}"
})]
version = "3.3.8"
}
Expand Down
207 changes: 2 additions & 205 deletions analytics/terraform/spark-k8s-operator/examples/s3-tables/README.md
Original file line number Diff line number Diff line change
@@ -1,206 +1,3 @@
# S3Table with OSS Spark on EKS Guide
# S3Table with OSS Spark on EKS

This guide provides step-by-step instructions for setting up and running a Spark job on Amazon EKS using S3Table for data storage.

## Prerequisites

- Latest version of AWS CLI installed (must include S3Tables API support)

## Step 1: Deploy Spark Cluster on EKS

Follow the steps to deploy Spark Cluster on EKS

[Spark Operator on EKS with YuniKorn Scheduler](https://awslabs.github.io/data-on-eks/docs/blueprints/data-analytics/spark-operator-yunikorn#prerequisites)

Once your cluster is up and running, proceed with the following steps to execute a sample Spark job using S3Tables.

## Step 2: Create Test Data for the job

Navigate to the example directory and Generate sample data:

```sh
cd analytics/terraform/spark-k8s-operator/examples/s3-tables
./input-data-gen.sh
```

This will create a file called `employee_data.csv` locally with 100 records. Modify the script to adjust the number of records as needed.

## Step 3: Upload Test Input data to your S3 Bucket

Replace `<YOUR_S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command.

```sh
aws s3 cp employee_data.csv s3://<S3_BUCKET>/s3table-example/input/
```

## Step 4: Upload PySpark Script to S3 Bucket

Replace `<S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command to upload sample Spark job to S3 buckets.

```sh
aws s3 cp s3table-iceberg-pyspark.py s3://<S3_BUCKET>/s3table-example/scripts/
```

## Step 5: Create S3Table

Replace <REGION> and <S3TABLE_BUCKET_NAME> with desired names.

```sh
aws s3tables create-table-bucket \
--region "<REGION>" \
--name "<S3TABLE_BUCKET_NAME>"
```

Make note of the S3TABLE ARN generated by this command.

<details>
<summary> <strong> Click here if you'd like to use Jupyter Notebook instead of Spark Operator </strong> </summary>

## Using S3Table in Jupyter Notebook

Take a look at an [example Jupyter Notebook](./s3table-iceberg-pyspark.ipynb).

#### Step 1

Forward proxy service port to your local machine.

```bash
kubectl port-forward svc/proxy-public 8888:80 -n jupyterhub
```

#### Step 2

1. Go to [`http://localhost:8888`](http://localhost:8888). Enter any username, leave the password field empty, then click "Sign in".


![sign in](./images/sign-in.png)

1. Click start. You can also select the upstream PySpark NoteBook image from the drop down list, if you'd like to customize it yourself.

![select](./images/select.png)

1. Copy examples from the [example Jupyter Notebook](./s3table-iceberg-pyspark.ipynb) to test S3 Tables features interactively.

![notebook](./images/notebook.png)


</details>

## Step 6: Update Spark Operator YAML File

- Open `s3table-spark-operator.yaml` file in your preferred text editor.
- Replace `<S3_BUCKET>` with your S3 bucket created by this blueprint(Check Terraform outputs). S3 Bucket where you copied test data and sample spark job in the above steps.
- REPLACE `<S3TABLE_ARN>` with your S3 Table ARN.

## Step 7: Execute Spark Job

Apply the updated YAML file to your Kubernetes cluster to submit the Spark Job.

```sh
cd analytics/terraform/spark-k8s-operator/examples/s3-tables
kubectl apply -f s3table-spark-operator.yaml
```

## Step 8: Verify the Spark Driver log for the output

Check the Spark driver logs to verify job progress and output:

```sh
kubectl logs <spark-driver-pod-name> -n spark-team-a
```

## Step 9: Verify the S3Table using S3Table API

Use the S3Table API to confirm the table was created successfully. Just replace the `<ACCOUNT_ID>` and run the command.

```sh
aws s3tables get-table --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table
```

Output looks like below.

```json
{
"name": "employee_s3_table",
"type": "customer",
"tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf",
"namespace": [
"doeks_namespace"
],
"versionToken": "aafc39ddd462690d2a0c",
"metadataLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3/metadata/00004-62cc4be3-59b5-4647-a78d-1cdf69ec5ed8.metadata.json",
"warehouseLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3",
"createdAt": "2025-01-07T22:14:48.689581+00:00",
"createdBy": "<ACCOUNT_ID>",
"modifiedAt": "2025-01-09T00:06:09.222917+00:00",
"ownerAccountId": "<ACCOUNT_ID>",
"format": "ICEBERG"
}
```

Monitor the table maintenance job status:

```sh
aws s3tables get-table-maintenance-job-status --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table
```

This command provides information about Iceberg compaction, snapshot management, and unreferenced file removal processes.

```json
{
"tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf",
"status": {
"icebergCompaction": {
"status": "Successful",
"lastRunTimestamp": "2025-01-08T01:18:08.857000+00:00"
},
"icebergSnapshotManagement": {
"status": "Successful",
"lastRunTimestamp": "2025-01-08T22:17:08.811000+00:00"
},
"icebergUnreferencedFileRemoval": {
"status": "Successful",
"lastRunTimestamp": "2025-01-08T22:17:10.377000+00:00"
}
}
}
```

## Step 10: Clean up

Delete the table.

```bash
aws s3tables delete-table \
--namespace doeks_namespace \
--table-bucket-arn ${S3TABLE_ARN} \
--name employee_s3_table
```

Delete the namespace.

```bash
aws s3tables delete-namespace \
--namespace doeks_namespace \
--table-bucket-arn ${S3TABLE_ARN}
```

Finally, delete the bucket table

```bash
aws s3tables delete-table-bucket \
--region "<REGION>" \
--table-bucket-arn ${S3TABLE_ARN}
```

If you created a Jupyter notebook server

```bash
kubectl delete pods -n jupyterhub -l component=singleuser-server
```


# Conclusion
You have successfully set up and run a Spark job on Amazon EKS using S3Table for data storage. This setup provides a scalable and efficient way to process large datasets using Spark on Kubernetes with the added benefits of S3Table's data management capabilities.

For more advanced usage, refer to the official AWS documentation on S3Table and Spark on EKS.
** Please see [our website](https://awslabs.github.io/data-on-eks/docs/blueprints/data-analytics/spark-operator-s3tables) for details on how to use this example **
12 changes: 6 additions & 6 deletions analytics/terraform/spark-k8s-operator/jupyterhub.tf
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ module "jupyterhub_single_user_irsa" {
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["${kubernetes_namespace.jupyterhub.metadata[0].name}:${module.eks.cluster_name}-jupyterhub-single-user"]
namespace_service_accounts = ["${kubernetes_namespace.jupyterhub[0].metadata[0].name}:${module.eks.cluster_name}-jupyterhub-single-user"]
}
}
}
Expand All @@ -31,8 +31,8 @@ resource "kubernetes_service_account_v1" "jupyterhub_single_user_sa" {
count = var.enable_jupyterhub ? 1 : 0
metadata {
name = "${module.eks.cluster_name}-jupyterhub-single-user"
namespace = kubernetes_namespace.jupyterhub.metadata[0].name
annotations = { "eks.amazonaws.com/role-arn" : module.jupyterhub_single_user_irsa.iam_role_arn }
namespace = kubernetes_namespace.jupyterhub[0].metadata[0].name
annotations = { "eks.amazonaws.com/role-arn" : module.jupyterhub_single_user_irsa[0].iam_role_arn }
}

automount_service_account_token = true
Expand All @@ -42,10 +42,10 @@ resource "kubernetes_secret_v1" "jupyterhub_single_user" {
count = var.enable_jupyterhub ? 1 : 0
metadata {
name = "${module.eks.cluster_name}-jupyterhub-single-user-secret"
namespace = kubernetes_namespace.jupyterhub.metadata[0].name
namespace = kubernetes_namespace.jupyterhub[0].metadata[0].name
annotations = {
"kubernetes.io/service-account.name" = kubernetes_service_account_v1.jupyterhub_single_user_sa.metadata[0].name
"kubernetes.io/service-account.namespace" = kubernetes_namespace.jupyterhub.metadata[0].name
"kubernetes.io/service-account.name" = kubernetes_service_account_v1.jupyterhub_single_user_sa[0].metadata[0].name
"kubernetes.io/service-account.namespace" = kubernetes_namespace.jupyterhub[0].metadata[0].name
}
}

Expand Down
59 changes: 59 additions & 0 deletions website/docs/blueprints/data-analytics/spark-operator-s3tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -393,6 +393,56 @@ Please note that these policies can further be adjusted and make it more granula
:::
<CollapsibleContent header={<h2><span>Using S3 Tables with JupyterLab </span></h2>}>
If you'd like to use JupyterLab to work with S3 Tables interactively, this blueprint allows you to deploy JupyterLab single user instances to your cluster.
> :warning: JupyterHub configurations available here are for testing purposes only.
>
> :warning: Review the configuration and make necessary changes to meet your security standards.
### Update Terraform variable and apply
```bash
cd ${DOEKS_HOME}/analytics/terraform/spark-k8s-operator
echo 'enable_jupyterhub = true' >> spark-operator.tfvars
terraform apply -var-file spark-operator.tfvars
```
### Ensure you have test data available in your S3 bucket
```bash
cd analytics/terraform/spark-k8s-operator/examples/s3-tables
./input-data-gen.sh
aws s3 cp employee_data.csv s3://${S3_BUCKET}/s3table-example/input/
```
### Access JupyterHub UI and provision a JupyterLab server
1. Port forward proxy service to your local machine.
```bash
kubectl port-forward svc/proxy-public 8888:80 -n jupyterhub
```
1. Go to [`http://localhost:8888`](http://localhost:8888). Enter any username, leave the password field empty, then click "Sign in".
![sign in](./img/s3tables-jupyter-signin.png)
1. Click start. You can also select the upstream PySpark NoteBook image from the drop down list, if you'd like to customize it yourself.
![select](./img/s3tables-jupyter-select.png)
1. Copy examples from the [example Jupyter Notebook](https://github.com/awslabs/data-on-eks/blob/main/analytics/terraform/spark-k8s-operator/examples/s3-tables/s3table-iceberg-pyspark.ipynb) as a starting point to test S3 Tables features interactively.
**Be sure to update `S3_BUCKET` and `s3table_arn` values in the notebook**
![notebook](./img/s3tables-jupyter-notebook.png)
</CollapsibleContent>
<CollapsibleContent header={<h2><span>Cleanup</span></h2>}>
:::caution
Expand Down Expand Up @@ -424,6 +474,15 @@ aws s3tables delete-table-bucket \
--table-bucket-arn ${S3TABLE_ARN}
```
## Delete the Jupyter Notebook server
If you created a Jupyter notebook server
```bash
kubectl delete pods -n jupyterhub -l component=singleuser-server
```
## Delete the EKS cluster
This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.
Expand Down
2 changes: 1 addition & 1 deletion website/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ const config = {
title: 'Data on EKS',
tagline: 'Supercharge your Data and AI/ML Journey with Amazon EKS',
url: 'https://awslabs.github.io',
baseUrl: '/data-on-eks/',
baseUrl: '/',
trailingSlash: false,
onBrokenLinks: 'throw',
onBrokenMarkdownLinks: 'warn',
Expand Down

0 comments on commit cf78d2d

Please sign in to comment.