From 0d09b9fc8095ba9dcc9bd6d80db2e6ac72f11f6b Mon Sep 17 00:00:00 2001 From: Ratnopam Charabarti Date: Tue, 5 Nov 2024 00:08:36 -0600 Subject: [PATCH] docs: Update Spark Benchmark Docs (#698) --- ai-ml/emr-spark-rapids/README.md | 2 +- ai-ml/nvidia-triton-server/README.md | 4 +- analytics/terraform/datahub-on-eks/README.md | 4 +- analytics/terraform/emr-eks-ack/README.md | 4 +- analytics/terraform/emr-eks-fargate/README.md | 4 +- .../terraform/emr-eks-karpenter/README.md | 2 +- .../terraform/spark-k8s-operator/README.md | 10 ++-- .../terraform/spark-k8s-operator/variables.tf | 2 +- schedulers/terraform/argo-workflow/README.md | 8 ++-- schedulers/terraform/aws-batch-eks/README.md | 6 +-- .../terraform/self-managed-airflow/README.md | 10 ++-- streaming/flink/README.md | 4 +- streaming/kafka/README.md | 8 ++-- streaming/nifi/README.md | 4 +- streaming/spark-streaming/terraform/README.md | 8 ++-- .../data-generation.md | 1 - .../spark-operator-eks-benchmark.md | 46 ++++++++++++++++++- 17 files changed, 86 insertions(+), 41 deletions(-) diff --git a/ai-ml/emr-spark-rapids/README.md b/ai-ml/emr-spark-rapids/README.md index e693a626d..bb91a9799 100644 --- a/ai-ml/emr-spark-rapids/README.md +++ b/ai-ml/emr-spark-rapids/README.md @@ -61,7 +61,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | [enable\_nvidia\_gpu\_operator](#input\_enable\_nvidia\_gpu\_operator) | Enable NVIDIA GPU Operator | `bool` | `false` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-spark-rapids"` | no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | | [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no | diff --git a/ai-ml/nvidia-triton-server/README.md b/ai-ml/nvidia-triton-server/README.md index b86c32977..66b726b02 100644 --- a/ai-ml/nvidia-triton-server/README.md +++ b/ai-ml/nvidia-triton-server/README.md @@ -79,9 +79,9 @@ | [huggingface\_token](#input\_huggingface\_token) | Hugging Face Secret Token | `string` | `"DUMMY_TOKEN_REPLACE_ME"` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"nvidia-triton-server"` | no | | [ngc\_api\_key](#input\_ngc\_api\_key) | NGC API Key | `string` | `"DUMMY_NGC_API_KEY_REPLACE_ME"` | no | -| [nim\_models](#input\_nim\_models) | NVIDIA NIM Models |
list(object({
name = string
id = string
enable = bool
num_gpu = string
}))
|
[
{
"enable": false,
"id": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
"name": "llama-3-1-8b-instruct",
"num_gpu": "4"
},
{
"enable": true,
"id": "nvcr.io/nim/meta/llama3-8b-instruct",
"name": "llama3-8b-instruct",
"num_gpu": "1"
}
]
| no | +| [nim\_models](#input\_nim\_models) | NVIDIA NIM Models |
list(object({
name = string
id = string
enable = bool
num_gpu = string
}))
|
[
{
"enable": false,
"id": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
"name": "llama-3-1-8b-instruct",
"num_gpu": "4"
},
{
"enable": true,
"id": "nvcr.io/nim/meta/llama3-8b-instruct",
"name": "llama3-8b-instruct",
"num_gpu": "1"
}
]
| no | | [region](#input\_region) | region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no | ## Outputs diff --git a/analytics/terraform/datahub-on-eks/README.md b/analytics/terraform/datahub-on-eks/README.md index f749fbb28..fd1d95814 100644 --- a/analytics/terraform/datahub-on-eks/README.md +++ b/analytics/terraform/datahub-on-eks/README.md @@ -46,8 +46,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"datahub-on-eks"` | no | | [private\_subnet\_ids](#input\_private\_subnet\_ids) | Ids for existing private subnets - needed when create\_vpc set to false | `list(string)` | `[]` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | | [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR - must change to match the cidr of the existing VPC if create\_vpc set to false | `string` | `"10.1.0.0/16"` | no | diff --git a/analytics/terraform/emr-eks-ack/README.md b/analytics/terraform/emr-eks-ack/README.md index e5308048a..05b4872ed 100644 --- a/analytics/terraform/emr-eks-ack/README.md +++ b/analytics/terraform/emr-eks-ack/README.md @@ -54,8 +54,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ |------|-------------|------|---------|:--------:| | [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.27"` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-ack"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | | [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no | diff --git a/analytics/terraform/emr-eks-fargate/README.md b/analytics/terraform/emr-eks-fargate/README.md index 2d6a8aa25..dae1b4eae 100644 --- a/analytics/terraform/emr-eks-fargate/README.md +++ b/analytics/terraform/emr-eks-fargate/README.md @@ -49,8 +49,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ |------|-------------|------|---------|:--------:| | [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.27"` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-fargate"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | | [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no | diff --git a/analytics/terraform/emr-eks-karpenter/README.md b/analytics/terraform/emr-eks-karpenter/README.md index ded52b607..b8e7166fd 100644 --- a/analytics/terraform/emr-eks-karpenter/README.md +++ b/analytics/terraform/emr-eks-karpenter/README.md @@ -89,7 +89,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `false` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-karpenter"` | no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | | [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no | diff --git a/analytics/terraform/spark-k8s-operator/README.md b/analytics/terraform/spark-k8s-operator/README.md index bf4e03370..0e143e2cc 100644 --- a/analytics/terraform/spark-k8s-operator/README.md +++ b/analytics/terraform/spark-k8s-operator/README.md @@ -72,16 +72,18 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | Name | Description | Type | Default | Required | |------|-------------|------|---------|:--------:| | [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.30"` | no | -| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | +| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | | [enable\_amazon\_prometheus](#input\_enable\_amazon\_prometheus) | Enable AWS Managed Prometheus service | `bool` | `true` | no | | [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no | | [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `true` | no | | [kms\_key\_admin\_roles](#input\_kms\_key\_admin\_roles) | list of role ARNs to add to the KMS policy | `list(string)` | `[]` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"spark-operator-doeks"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [spark\_benchmark\_ssd\_desired\_size](#input\_spark\_benchmark\_ssd\_desired\_size) | Desired size for nodegroup of c5d 12xlarge instances to run data generation for Spark benchmark | `number` | `0` | no | +| [spark\_benchmark\_ssd\_min\_size](#input\_spark\_benchmark\_ssd\_min\_size) | Minimum size for nodegroup of c5d 12xlarge instances to run data generation for Spark benchmark | `number` | `0` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/16"` | no | ## Outputs diff --git a/analytics/terraform/spark-k8s-operator/variables.tf b/analytics/terraform/spark-k8s-operator/variables.tf index be787f4da..ce47e40d6 100644 --- a/analytics/terraform/spark-k8s-operator/variables.tf +++ b/analytics/terraform/spark-k8s-operator/variables.tf @@ -88,4 +88,4 @@ variable "spark_benchmark_ssd_desired_size" { description = "Desired size for nodegroup of c5d 12xlarge instances to run data generation for Spark benchmark" type = number default = 0 -} \ No newline at end of file +} diff --git a/schedulers/terraform/argo-workflow/README.md b/schedulers/terraform/argo-workflow/README.md index ebb39c7c2..6119fccdd 100644 --- a/schedulers/terraform/argo-workflow/README.md +++ b/schedulers/terraform/argo-workflow/README.md @@ -78,15 +78,15 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | Name | Description | Type | Default | Required | |------|-------------|------|---------|:--------:| | [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.29"` | no | -| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | +| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | | [enable\_amazon\_prometheus](#input\_enable\_amazon\_prometheus) | Enable AWS Managed Prometheus service | `bool` | `true` | no | | [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no | | [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `true` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"doeks-spark-argo"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/16"` | no | ## Outputs diff --git a/schedulers/terraform/aws-batch-eks/README.md b/schedulers/terraform/aws-batch-eks/README.md index 3a066e501..49fedd1cf 100644 --- a/schedulers/terraform/aws-batch-eks/README.md +++ b/schedulers/terraform/aws-batch-eks/README.md @@ -58,7 +58,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | [aws\_batch\_doeks\_jd\_name](#input\_aws\_batch\_doeks\_jd\_name) | The AWS Batch example job definition name | `string` | `"doeks-hello-world"` | no | | [aws\_batch\_doeks\_jq\_name](#input\_aws\_batch\_doeks\_jq\_name) | The AWS Batch EKS namespace | `string` | `"doeks-JQ1"` | no | | [aws\_batch\_doeks\_namespace](#input\_aws\_batch\_doeks\_namespace) | The AWS Batch EKS namespace | `string` | `"doeks-aws-batch"` | no | -| [aws\_batch\_instance\_types](#input\_aws\_batch\_instance\_types) | The set of instance types to launch for AWS Batch jobs. | `list(string)` |
[
"optimal"
]
| no | +| [aws\_batch\_instance\_types](#input\_aws\_batch\_instance\_types) | The set of instance types to launch for AWS Batch jobs. | `list(string)` |
[
"optimal"
]
| no | | [aws\_batch\_max\_vcpus](#input\_aws\_batch\_max\_vcpus) | The minimum aggregate vCPU for AWS Batch compute environment | `number` | `256` | no | | [aws\_batch\_min\_vcpus](#input\_aws\_batch\_min\_vcpus) | The minimum aggregate vCPU for AWS Batch compute environment | `number` | `0` | no | | [aws\_region](#input\_aws\_region) | AWS Region | `string` | `"us-east-1"` | no | @@ -67,8 +67,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | [eks\_private\_cluster\_endpoint](#input\_eks\_private\_cluster\_endpoint) | Whether to have a private cluster endpoint for the EKS cluster. | `bool` | `true` | no | | [eks\_public\_cluster\_endpoint](#input\_eks\_public\_cluster\_endpoint) | Whether to have a public cluster endpoint for the EKS cluster. #WARNING: Avoid a public endpoint in preprod or prod accounts. This feature is designed for sandbox accounts, simplifying cluster deployment and testing. | `bool` | `true` | no | | [num\_azs](#input\_num\_azs) | The number of Availability Zones to deploy subnets to. Must be 2 or more | `number` | `2` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | | [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no | diff --git a/schedulers/terraform/self-managed-airflow/README.md b/schedulers/terraform/self-managed-airflow/README.md index 5d7260551..c0f229deb 100644 --- a/schedulers/terraform/self-managed-airflow/README.md +++ b/schedulers/terraform/self-managed-airflow/README.md @@ -99,17 +99,17 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | Name | Description | Type | Default | Required | |------|-------------|------|---------|:--------:| -| [db\_private\_subnets](#input\_db\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Airflow DB. | `list(string)` |
[
"10.0.20.0/26",
"10.0.21.0/26"
]
| no | +| [db\_private\_subnets](#input\_db\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Airflow DB. | `list(string)` |
[
"10.0.20.0/26",
"10.0.21.0/26"
]
| no | | [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.29"` | no | -| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | +| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | | [enable\_airflow](#input\_enable\_airflow) | Enable Apache Airflow | `bool` | `true` | no | | [enable\_airflow\_spark\_example](#input\_enable\_airflow\_spark\_example) | Enable Apache Airflow and Spark Operator example | `bool` | `false` | no | | [enable\_amazon\_prometheus](#input\_enable\_amazon\_prometheus) | Enable AWS Managed Prometheus service | `bool` | `true` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"self-managed-airflow"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.0.1.0/24",
"10.0.2.0/24"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.0.0.0/26",
"10.0.0.64/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.0.1.0/24",
"10.0.2.0/24"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.0.0.0/26",
"10.0.0.64/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.0.0.0/16"` | no | ## Outputs diff --git a/streaming/flink/README.md b/streaming/flink/README.md index fb2673697..876d564fb 100755 --- a/streaming/flink/README.md +++ b/streaming/flink/README.md @@ -73,8 +73,8 @@ | [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no | | [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `true` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"flink-operator-doeks"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` |
[
"10.1.0.0/17",
"10.1.128.0/18"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` |
[
"10.1.255.128/26",
"10.1.255.192/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no | diff --git a/streaming/kafka/README.md b/streaming/kafka/README.md index f1deaafb1..56fc5cf55 100644 --- a/streaming/kafka/README.md +++ b/streaming/kafka/README.md @@ -59,13 +59,13 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | Name | Description | Type | Default | Required | |------|-------------|------|---------|:--------:| | [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.31"` | no | -| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | +| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | | [enable\_amazon\_prometheus](#input\_enable\_amazon\_prometheus) | Enable AWS Managed Prometheus service | `bool` | `true` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"kafka-on-eks"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/16"` | no | ## Outputs diff --git a/streaming/nifi/README.md b/streaming/nifi/README.md index c1894ab32..25ce367da 100644 --- a/streaming/nifi/README.md +++ b/streaming/nifi/README.md @@ -82,8 +82,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"nifi-on-eks"` | no | | [nifi\_sub\_domain](#input\_nifi\_sub\_domain) | Subdomain for NiFi cluster. | `string` | `"mynifi"` | no | | [nifi\_username](#input\_nifi\_username) | NiFi login username | `string` | `"admin"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 16382 IPs per Subnet | `list(string)` |
[
"10.1.0.0/18",
"10.1.64.0/18",
"10.1.128.0/18"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 4094 IPs per Subnet | `list(string)` |
[
"10.1.192.0/20",
"10.1.208.0/20",
"10.1.224.0/20"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 16382 IPs per Subnet | `list(string)` |
[
"10.1.0.0/18",
"10.1.64.0/18",
"10.1.128.0/18"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 4094 IPs per Subnet | `list(string)` |
[
"10.1.192.0/20",
"10.1.208.0/20",
"10.1.224.0/20"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no | diff --git a/streaming/spark-streaming/terraform/README.md b/streaming/spark-streaming/terraform/README.md index 9dd02dfbf..d60b6dd78 100644 --- a/streaming/spark-streaming/terraform/README.md +++ b/streaming/spark-streaming/terraform/README.md @@ -70,15 +70,15 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/ | Name | Description | Type | Default | Required | |------|-------------|------|---------|:--------:| | [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.29"` | no | -| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | +| [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` |
[
"100.64.0.0/17",
"100.64.128.0/17"
]
| no | | [enable\_amazon\_prometheus](#input\_enable\_amazon\_prometheus) | Enable AWS Managed Prometheus service | `bool` | `true` | no | | [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no | | [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `false` | no | | [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"spark-streaming-doeks"` | no | -| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | -| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | +| [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` |
[
"10.1.1.0/24",
"10.1.2.0/24"
]
| no | +| [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` |
[
"10.1.0.0/26",
"10.1.0.64/26"
]
| no | | [region](#input\_region) | Region | `string` | `"us-west-2"` | no | -| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | | [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/16"` | no | ## Outputs diff --git a/website/docs/benchmarks/spark-operator-benchmark/data-generation.md b/website/docs/benchmarks/spark-operator-benchmark/data-generation.md index e228fc204..f494de143 100644 --- a/website/docs/benchmarks/spark-operator-benchmark/data-generation.md +++ b/website/docs/benchmarks/spark-operator-benchmark/data-generation.md @@ -67,4 +67,3 @@ echo $S3_BUCKET ``` - diff --git a/website/docs/benchmarks/spark-operator-benchmark/spark-operator-eks-benchmark.md b/website/docs/benchmarks/spark-operator-benchmark/spark-operator-eks-benchmark.md index 9c2485b9b..50b8b47a0 100644 --- a/website/docs/benchmarks/spark-operator-benchmark/spark-operator-eks-benchmark.md +++ b/website/docs/benchmarks/spark-operator-benchmark/spark-operator-eks-benchmark.md @@ -13,4 +13,48 @@ Key Features 📈 - Benchmark Results - Customizable Benchmarks to suit your workloads - Autoscaling and Cost Optimization Strategies - \ No newline at end of file + +## 📊 TPC-DS Benchmark for Spark + +The TPC-DS benchmark is an industry-standard benchmark created by the Transaction Processing Performance Council (TPC) to evaluate the performance of decision support systems. It is widely used to assess the efficiency of SQL-based data processing engines in handling complex analytical queries. Databricks has optimized TPC-DS for Apache Spark, making it highly relevant for evaluating Spark workloads in cloud and on-premise environments. + +The TPC-DS benchmark tests cover a range of complex, data-intensive queries that simulate realistic decision support systems, making it ideal for assessing Spark's performance with structured and semi-structured data. + +📈 Some Details on TPC-DS Benchmarks for Spark + +The TPC-DS benchmark for Spark on Databricks includes the following features and considerations: + +- Dataset Size: TPC-DS provides standardized datasets ranging from a few gigabytes to multiple terabytes, allowing users to test Spark's scalability. +- Query Suite: The benchmark includes 99 queries that simulate real-world decision support scenarios. These queries test Spark's ability to handle joins, aggregations, and complex nested queries efficiently. +- Optimization Techniques: Databricks has implemented various Spark-specific optimizations to improve TPC-DS query performance, such as Catalyst Optimizer, Adaptive Query Execution (AQE), and optimized shuffle operations. +- Execution Environment: The benchmark tests typically run in a distributed environment and cloud platform (such as Amazon EKS), leveraging optimized runtime for Apache Spark. + +**Note:** The results of TPC-DS benchmarks on Spark can vary significantly depending on cluster configurations, instance types, and Spark settings. + +### ✨ Why We Need the Benchmark Tests + +Benchmark tests are crucial for the following reasons: + +- Performance Evaluation: They provide an objective measure to evaluate and compare the speed, efficiency, and scalability of Spark engines on various cloud and hardware configurations. + +- Optimization Insights: By analyzing benchmark results, users can gain insights into specific areas where Spark might be optimized to handle large-scale, complex analytical queries. + +- Infrastructure Comparison: Running TPC-DS on Spark on Amazon EKS allows teams to make data-driven decisions on the best platforms for their data workloads. + +- Operational Readiness: Ensures that the Spark setup is ready to handle large datasets and complex analytical tasks, reducing risks in production environments. + +### 🎁 What We Offer + +With Data-on-EKS blueprints, we offer + +- Ready to be deployed Infrastructure-as-Code(Terraform) templates +- Benchmark Test Data Generation Toolkit +- Benchmark Execution Toolkit + +### 💰 A Note on Cost + +To minimize costs, we recommend terminating the `C5d` instances once the benchmark data generation toolkit completes execution and the data has been successfully stored in the S3 bucket. This approach ensures that resources are only used when necessary, helping to keep expenses under control. + +## 🔗 Additional Resources + +[TPCDS Specification](https://www.tpc.org/tpcds/default5.asp)