move documentation to website

Signed-off-by: Manabu McCloskey <[email protected]>
awslabs · Jan 14, 2025 · cf78d2d · cf78d2d
1 parent b0910d2
commit cf78d2d
Show file tree

Hide file tree

Showing 8 changed files with 69 additions and 213 deletions.
diff --git a/analytics/terraform/spark-k8s-operator/addons.tf b/analytics/terraform/spark-k8s-operator/addons.tf
@@ -430,7 +430,7 @@ module "eks_data_addons" {
   enable_jupyterhub = var.enable_jupyterhub
   jupyterhub_helm_config = {
     values = [templatefile("${path.module}/helm-values/jupyterhub-singleuser-values.yaml", {
-      jupyter_single_user_sa_name = kubernetes_service_account_v1.jupyterhub_single_user_sa.metadata[0].name
+      jupyter_single_user_sa_name = "${var.enable_jupyterhub ? kubernetes_service_account_v1.jupyterhub_single_user_sa[0].metadata[0].name : "no-tused"}"
     })]
     version = "3.3.8"
   }

diff --git a/analytics/terraform/spark-k8s-operator/examples/s3-tables/README.md b/analytics/terraform/spark-k8s-operator/examples/s3-tables/README.md
@@ -1,206 +1,3 @@
-# S3Table with OSS Spark on EKS Guide
+# S3Table with OSS Spark on EKS
 
-This guide provides step-by-step instructions for setting up and running a Spark job on Amazon EKS using S3Table for data storage.
-
-## Prerequisites
-
-- Latest version of AWS CLI installed (must include S3Tables API support)
-
-## Step 1: Deploy Spark Cluster on EKS
-
-Follow the steps to deploy Spark Cluster on EKS
-
-[Spark Operator on EKS with YuniKorn Scheduler](https://awslabs.github.io/data-on-eks/docs/blueprints/data-analytics/spark-operator-yunikorn#prerequisites)
-
-Once your cluster is up and running, proceed with the following steps to execute a sample Spark job using S3Tables.
-
-## Step 2: Create Test Data for the job
-
-Navigate to the example directory and Generate sample data:
-
-```sh
-cd analytics/terraform/spark-k8s-operator/examples/s3-tables
-./input-data-gen.sh
-```
-
-This will create a file called `employee_data.csv` locally with 100 records. Modify the script to adjust the number of records as needed.
-
-## Step 3: Upload Test Input data to your S3 Bucket
-
-Replace `<YOUR_S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command.
-
-```sh
-aws s3 cp employee_data.csv s3://<S3_BUCKET>/s3table-example/input/
-```
-
-## Step 4: Upload PySpark Script to S3 Bucket
-
-Replace `<S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command to upload sample Spark job to S3 buckets.
-
-```sh
-aws s3 cp s3table-iceberg-pyspark.py s3://<S3_BUCKET>/s3table-example/scripts/
-```
-
-## Step 5: Create S3Table
-
-Replace <REGION> and <S3TABLE_BUCKET_NAME> with desired names.
-
-```sh
-aws s3tables create-table-bucket \
-    --region "<REGION>" \
-    --name "<S3TABLE_BUCKET_NAME>"
-```
-
-Make note of the S3TABLE ARN generated by this command.
-
-<details>
-  <summary> <strong> Click here if you'd like to use Jupyter Notebook instead of Spark Operator </strong> </summary>
-
-  ## Using S3Table in Jupyter Notebook
-
-  Take a look at an [example Jupyter Notebook](./s3table-iceberg-pyspark.ipynb).
-
-  #### Step 1
-
-  Forward proxy service port to your local machine.
-
-  ```bash
-  kubectl port-forward svc/proxy-public 8888:80 -n jupyterhub
-  ```
-
-  #### Step 2
-
-  1. Go to [`http://localhost:8888`](http://localhost:8888). Enter any username, leave the password field empty, then click "Sign in".
-
-
-      ![sign in](./images/sign-in.png)
-
-  1. Click start. You can also select the upstream PySpark NoteBook image from the drop down list, if you'd like to customize it yourself.
-
-      ![select](./images/select.png)
-
-  1. Copy examples from the [example Jupyter Notebook](./s3table-iceberg-pyspark.ipynb) to test S3 Tables features interactively.
-
-       ![notebook](./images/notebook.png)
-
-
-</details>
-
-## Step 6: Update Spark Operator YAML File
-
- - Open `s3table-spark-operator.yaml` file in your preferred text editor.
- - Replace `<S3_BUCKET>` with your S3 bucket created by this blueprint(Check Terraform outputs). S3 Bucket where you copied test data and sample spark job in the above steps.
- - REPLACE `<S3TABLE_ARN>` with your S3 Table ARN.
-
-## Step 7: Execute Spark Job
-
-Apply the updated YAML file to your Kubernetes cluster to submit the Spark Job.
-
-```sh
-cd analytics/terraform/spark-k8s-operator/examples/s3-tables
-kubectl apply -f s3table-spark-operator.yaml
-```
-
-## Step 8: Verify the Spark Driver log for the output
-
-Check the Spark driver logs to verify job progress and output:
-
-```sh
-kubectl logs <spark-driver-pod-name> -n spark-team-a
-```
-
-## Step 9: Verify the S3Table using S3Table API
-
-Use the S3Table API to confirm the table was created successfully. Just replace the `<ACCOUNT_ID>` and run the command.
-
-```sh
-aws s3tables get-table --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table
-```
-
-Output looks like below.
-
-```json
-{
-    "name": "employee_s3_table",
-    "type": "customer",
-    "tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf",
-    "namespace": [
-        "doeks_namespace"
-    ],
-    "versionToken": "aafc39ddd462690d2a0c",
-    "metadataLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3/metadata/00004-62cc4be3-59b5-4647-a78d-1cdf69ec5ed8.metadata.json",
-    "warehouseLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3",
-    "createdAt": "2025-01-07T22:14:48.689581+00:00",
-    "createdBy": "<ACCOUNT_ID>",
-    "modifiedAt": "2025-01-09T00:06:09.222917+00:00",
-    "ownerAccountId": "<ACCOUNT_ID>",
-    "format": "ICEBERG"
-}
-```
-
-Monitor the table maintenance job status:
-
-```sh
-aws s3tables get-table-maintenance-job-status --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table
-```
-
-This command provides information about Iceberg compaction, snapshot management, and unreferenced file removal processes.
-
-```json
-{
-    "tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf",
-    "status": {
-        "icebergCompaction": {
-            "status": "Successful",
-            "lastRunTimestamp": "2025-01-08T01:18:08.857000+00:00"
-        },
-        "icebergSnapshotManagement": {
-            "status": "Successful",
-            "lastRunTimestamp": "2025-01-08T22:17:08.811000+00:00"
-        },
-        "icebergUnreferencedFileRemoval": {
-            "status": "Successful",
-            "lastRunTimestamp": "2025-01-08T22:17:10.377000+00:00"
-        }
-    }
-}
-```
-
-## Step 10: Clean up
-
-Delete the table.
-
-```bash
-aws s3tables delete-table \
-  --namespace doeks_namespace \
-  --table-bucket-arn ${S3TABLE_ARN} \
-  --name employee_s3_table
-```
-
-Delete the namespace.
-
-```bash
-aws s3tables delete-namespace \
-  --namespace doeks_namespace \
-  --table-bucket-arn ${S3TABLE_ARN}
-```
-
-Finally, delete the bucket table
-
-```bash
-aws s3tables delete-table-bucket \
-  --region "<REGION>" \
-  --table-bucket-arn ${S3TABLE_ARN}
-```
-
-If you created a Jupyter notebook server
-
-```bash
-kubectl delete pods -n jupyterhub -l component=singleuser-server
-```
-
-
-# Conclusion
-You have successfully set up and run a Spark job on Amazon EKS using S3Table for data storage. This setup provides a scalable and efficient way to process large datasets using Spark on Kubernetes with the added benefits of S3Table's data management capabilities.
-
-For more advanced usage, refer to the official AWS documentation on S3Table and Spark on EKS.
+** Please see [our website](https://awslabs.github.io/data-on-eks/docs/blueprints/data-analytics/spark-operator-s3tables) for details on how to use this example **
diff --git a/analytics/terraform/spark-k8s-operator/jupyterhub.tf b/analytics/terraform/spark-k8s-operator/jupyterhub.tf
@@ -22,7 +22,7 @@ module "jupyterhub_single_user_irsa" {
   oidc_providers = {
     main = {
       provider_arn               = module.eks.oidc_provider_arn
-      namespace_service_accounts = ["${kubernetes_namespace.jupyterhub.metadata[0].name}:${module.eks.cluster_name}-jupyterhub-single-user"]
+      namespace_service_accounts = ["${kubernetes_namespace.jupyterhub[0].metadata[0].name}:${module.eks.cluster_name}-jupyterhub-single-user"]
     }
   }
 }
@@ -31,8 +31,8 @@ resource "kubernetes_service_account_v1" "jupyterhub_single_user_sa" {
   count = var.enable_jupyterhub ? 1 : 0
   metadata {
     name        = "${module.eks.cluster_name}-jupyterhub-single-user"
-    namespace   = kubernetes_namespace.jupyterhub.metadata[0].name
-    annotations = { "eks.amazonaws.com/role-arn" : module.jupyterhub_single_user_irsa.iam_role_arn }
+    namespace   = kubernetes_namespace.jupyterhub[0].metadata[0].name
+    annotations = { "eks.amazonaws.com/role-arn" : module.jupyterhub_single_user_irsa[0].iam_role_arn }
   }
 
   automount_service_account_token = true
@@ -42,10 +42,10 @@ resource "kubernetes_secret_v1" "jupyterhub_single_user" {
   count = var.enable_jupyterhub ? 1 : 0
   metadata {
     name      = "${module.eks.cluster_name}-jupyterhub-single-user-secret"
-    namespace = kubernetes_namespace.jupyterhub.metadata[0].name
+    namespace = kubernetes_namespace.jupyterhub[0].metadata[0].name
     annotations = {
-      "kubernetes.io/service-account.name"      = kubernetes_service_account_v1.jupyterhub_single_user_sa.metadata[0].name
-      "kubernetes.io/service-account.namespace" = kubernetes_namespace.jupyterhub.metadata[0].name
+      "kubernetes.io/service-account.name"      = kubernetes_service_account_v1.jupyterhub_single_user_sa[0].metadata[0].name
+      "kubernetes.io/service-account.namespace" = kubernetes_namespace.jupyterhub[0].metadata[0].name
     }
   }
 

diff --git a/...or/examples/s3-tables/images/notebook.png → ...alytics/img/s3tables-jupyter-notebook.png b/...or/examples/s3-tables/images/notebook.png → ...alytics/img/s3tables-jupyter-notebook.png
diff --git a/...ator/examples/s3-tables/images/select.png → ...analytics/img/s3tables-jupyter-select.png b/...ator/examples/s3-tables/images/select.png → ...analytics/img/s3tables-jupyter-select.png
diff --git a/...tor/examples/s3-tables/images/sign-in.png → ...analytics/img/s3tables-jupyter-signin.png b/...tor/examples/s3-tables/images/sign-in.png → ...analytics/img/s3tables-jupyter-signin.png
diff --git a/website/docs/blueprints/data-analytics/spark-operator-s3tables.md b/website/docs/blueprints/data-analytics/spark-operator-s3tables.md
@@ -393,6 +393,56 @@ Please note that these policies can further be adjusted and make it more granula
 
 :::
 
+
+<CollapsibleContent header={<h2><span>Using S3 Tables with JupyterLab </span></h2>}>
+
+If you'd like to use JupyterLab to work with S3 Tables interactively, this blueprint allows you to deploy JupyterLab single user instances to your cluster.
+
+> :warning: JupyterHub configurations available here are for testing purposes only.
+>
+> :warning: Review the configuration and make necessary changes to meet your security standards.
+
+### Update Terraform variable and apply
+
+```bash
+cd ${DOEKS_HOME}/analytics/terraform/spark-k8s-operator
+
+echo 'enable_jupyterhub = true' >> spark-operator.tfvars
+terraform apply -var-file spark-operator.tfvars
+```
+
+### Ensure you have test data available in your S3 bucket
+
+```bash
+cd analytics/terraform/spark-k8s-operator/examples/s3-tables
+./input-data-gen.sh
+aws s3 cp employee_data.csv s3://${S3_BUCKET}/s3table-example/input/
+```
+
+### Access JupyterHub UI and provision a JupyterLab server
+
+1. Port forward proxy service to your local machine.
+
+    ```bash
+    kubectl port-forward svc/proxy-public 8888:80 -n jupyterhub
+    ```
+
+1. Go to [`http://localhost:8888`](http://localhost:8888). Enter any username, leave the password field empty, then click "Sign in".
+
+   ![sign in](./img/s3tables-jupyter-signin.png)
+
+1. Click start. You can also select the upstream PySpark NoteBook image from the drop down list, if you'd like to customize it yourself.
+
+    ![select](./img/s3tables-jupyter-select.png)
+
+1. Copy examples from the [example Jupyter Notebook](https://github.com/awslabs/data-on-eks/blob/main/analytics/terraform/spark-k8s-operator/examples/s3-tables/s3table-iceberg-pyspark.ipynb) as a starting point to test S3 Tables features interactively.
+
+    **Be sure to update `S3_BUCKET` and `s3table_arn` values in the notebook**
+
+    ![notebook](./img/s3tables-jupyter-notebook.png)
+
+</CollapsibleContent>
+
 <CollapsibleContent header={<h2><span>Cleanup</span></h2>}>
 
 :::caution
@@ -424,6 +474,15 @@ aws s3tables delete-table-bucket \
   --table-bucket-arn ${S3TABLE_ARN}
 ```
 
+## Delete the Jupyter Notebook server
+
+If you created a Jupyter notebook server
+
+```bash
+kubectl delete pods -n jupyterhub -l component=singleuser-server
+```
+
+
 ## Delete the EKS cluster
 
 This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.

diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
@@ -9,7 +9,7 @@ const config = {
   title: 'Data on EKS',
   tagline: 'Supercharge your Data and AI/ML Journey with Amazon EKS',
   url: 'https://awslabs.github.io',
-  baseUrl: '/data-on-eks/',
+  baseUrl: '/',
   trailingSlash: false,
   onBrokenLinks: 'throw',
   onBrokenMarkdownLinks: 'warn',