GoogleCloudPlatform · danieldeleo · May 11, 2022 · May 11, 2022 · May 19, 2022 · May 22, 2022
diff --git a/tools/terraform_sync_tool/README.md b/tools/terraform_sync_tool/README.md
@@ -0,0 +1,159 @@
+# Terraform Sync Tool
+
+This directory contains the setup for the Terraform Sync Tool. Terraform Sync Tool was designed to address the schema drifts in BigQuery tables and keep the 
+Terraform schemas up-to-date with the BigQuery table schemas in production environment. Schema drifts occurred when BigQuery Table schemas are updated by newly 
+ingested data while Terraform schema files contain the outdated schemas. Therefore, this tool will detect the schema drifts, trace the origins of the drifts, and alert
+developers/data engineers.
-This directory contains the setup for the Terraform Sync Tool. Terraform Sync Tool was designed to address the schema drifts in BigQuery tables and keep the 
-Terraform schemas up-to-date with the BigQuery table schemas in production environment. Schema drifts occurred when BigQuery Table schemas are updated by newly 
-ingested data while Terraform schema files contain the outdated schemas. Therefore, this tool will detect the schema drifts, trace the origins of the drifts, and alert
-developers/data engineers.
+This directory contains the Terraform Sync Tool. This tool intentionally fails your CI/CD pipeline when schema drifts occur between what your BigQuery Terraform resources declare and what's actually present in your BigQuery environment. Theses schema drifts happen when BigQuery tables are updated by processes outside of Terraform (ETL process may dynamically add new columns when loading data into BigQuery). When drifts occur, you end up with outdated BigQuery Terraform resource files. This tool detects the schema drifts, traces the origins of the drifts, and alerts developers/data engineers (by failing the CI/CD pipeline) so they can patch the Terraform in their current commit.
-This directory contains the setup for the Terraform Sync Tool. Terraform Sync Tool was designed to address the schema drifts in BigQuery tables and keep the 
-Terraform schemas up-to-date with the BigQuery table schemas in production environment. Schema drifts occurred when BigQuery Table schemas are updated by newly 
-ingested data while Terraform schema files contain the outdated schemas. Therefore, this tool will detect the schema drifts, trace the origins of the drifts, and alert
-developers/data engineers.
+This directory contains the Terraform Sync Tool. This tool intentionally fails your CI/CD pipeline when schema drifts occur between what your BigQuery Terraform resources declare and what's actually present in your BigQuery environment. Theses schema drifts happen when BigQuery tables are updated by processes outside of Terraform (ETL process may dynamically add new columns when loading data into BigQuery). When drifts occur, you end up with outdated BigQuery Terraform resource files. This tool detects the schema drifts, traces the origins of the drifts, and alerts developers/data engineers (by failing the CI/CD pipeline) so they can patch the Terraform in their current commit.
+
+The Terraform Schema Sync Tool fails the build attemps if resource drifts are detected and notifies the latest resource information. Developers and data engineers should be able to update the Terraform resources accordingly. 
-The Terraform Schema Sync Tool fails the build attemps if resource drifts are detected and notifies the latest resource information. Developers and data engineers should be able to update the Terraform resources accordingly. 
-The Terraform Schema Sync Tool fails the build attemps if resource drifts are detected and notifies the latest resource information. Developers and data engineers should be able to update the Terraform resources accordingly. 
+
+Terraform Sync Tool can be integrated into your CI/CD pipeline. You'll need to add two steps to CI/CD pipeline. 
+- Step 0: Use Terraform/erragrunt command to detect resource drifts and write output into a JSON file
- Step 0: Use Terraform/erragrunt command to detect resource drifts and write output into a JSON file
+- Step 0: Run the Terraform plan command (using either Terraform/Terragrunt) with the `-json` option  and write the output into a JSON file using the caret operator `> output.json`
- Step 0: Use Terraform/erragrunt command to detect resource drifts and write output into a JSON file
+- Step 0: Run the Terraform plan command (using either Terraform/Terragrunt) with the `-json` option  and write the output into a JSON file using the caret operator `> output.json`
+- Step 1: Use Python scripts to identify and investigate the drifts
+
+## How to run Terraform Schema Sync Tool
-## How to run Terraform Schema Sync Tool
+## How to run Terraform Schema Sync Tool
+
+```bash
+###############
+# Using Terragrunt
+###############
+terragrunt run-all plan -json --terragrunt-non-interactive > plan_output.json
+python3 terraform_sync.py plan_output.json
+
+##############
+# Using Terraform
+##############
+terraform plan -json > plan_output.json
+python3 terraform_sync.py plan_output.json
-## How to run Terraform Schema Sync Tool
+## How to run Terraform Schema Sync Tool
+
+```bash
+###############
+# Using Terragrunt
+###############
+terragrunt run-all plan -json --terragrunt-non-interactive > plan_output.json
+python3 terraform_sync.py plan_output.json
+
+##############
+# Using Terraform
+##############
+terraform plan -json > plan_output.json
+python3 terraform_sync.py plan_output.json
+
+#### Use Terraform/Terragrunt commands to test if any resources drifts existed
+
+Terragrunt/Terraform commands:
+```
+terragrunt run-all plan -json --terragrunt-non-interactive
+
+# Terraform Command
+terraform plan -json
+```
+
+After running the Terrform plan command, **the event type "resource_drift"("type": "resource_drift") indicates a drift has occurred**.
+If drifts detected, please update your terraform configurations and address the resource drifts based on the event outputs.
+
+
+#### Add Could Build Steps to your configuration file
+
+Please check cloud build steps in `cloudbuild.yaml` file, and add these steps to your Cloud Build Configuration File.
+
+- step 0: run terraform commands in `deploy.sh` to detects drifts
+
+Add `deploy.sh` to your project directory. 
+
+- step 1: run python scripts to investigate terraform output
+
+Add `requirements.txt` and `terraform_sync.py` to your project directory.
+
+#### (Optional if you haven't created Cloud Build Trigger) Create and configure a new Trigger in Cloud Build
+Make sure to indicate your cloud configuration file location correctly.
+
+#### That's all you need! Let's commit and test in CLoud Build!
+
-Terragrunt/Terraform commands:
-```
-terragrunt run-all plan -json --terragrunt-non-interactive
-
-# Terraform Command
-terraform plan -json
-```
-
-After running the Terrform plan command, **the event type "resource_drift"("type": "resource_drift") indicates a drift has occurred**.
-If drifts detected, please update your terraform configurations and address the resource drifts based on the event outputs.
-
-
-#### Add Could Build Steps to your configuration file
-
-Please check cloud build steps in `cloudbuild.yaml` file, and add these steps to your Cloud Build Configuration File.
-
- step 0: run terraform commands in `deploy.sh` to detects drifts
-
-Add `deploy.sh` to your project directory. 
-
- step 1: run python scripts to investigate terraform output
-
-Add `requirements.txt` and `terraform_sync.py` to your project directory.
-
-#### (Optional if you haven't created Cloud Build Trigger) Create and configure a new Trigger in Cloud Build
-Make sure to indicate your cloud configuration file location correctly.
-
-#### That's all you need! Let's commit and test in CLoud Build!
-Terragrunt/Terraform commands:
-```
-terragrunt run-all plan -json --terragrunt-non-interactive
-
-# Terraform Command
-terraform plan -json
-```
-
-After running the Terrform plan command, **the event type "resource_drift"("type": "resource_drift") indicates a drift has occurred**.
-If drifts detected, please update your terraform configurations and address the resource drifts based on the event outputs.
-
-
-#### Add Could Build Steps to your configuration file
-
-Please check cloud build steps in `cloudbuild.yaml` file, and add these steps to your Cloud Build Configuration File.
-
- step 0: run terraform commands in `deploy.sh` to detects drifts
-
-Add `deploy.sh` to your project directory. 
-
- step 1: run python scripts to investigate terraform output
-
-Add `requirements.txt` and `terraform_sync.py` to your project directory.
-
-#### (Optional if you haven't created Cloud Build Trigger) Create and configure a new Trigger in Cloud Build
-Make sure to indicate your cloud configuration file location correctly.
-
-#### That's all you need! Let's commit and test in CLoud Build!
+## How Terraform Schema Sync Tool Works
+
+![Architecture Diagram](architecture.png)
+
+**Executing the Sync Tool**
+
+The Terraform Sync Tool will be executed as part of the CI/CD pipeline build steps triggered anytime when developers make a change to the linked repository. A build step specifies an action that you want Cloud Build to perform. For each build step, Cloud Build executes a docker container as an instance of docker run. 
+
+**Step 0: Terraform Detects Drifts**
+
+`deploy.sh` contains terraform plan command that writes event outputs into `plan_out.json` file. We'll use `plan_out.json` file for further investigation in the future steps. Feel free to repace `plan_out.json` with your JSON filename. we can pass through the variables ${env} and ${tool} if any. 
+
+**Step 1: Investigate Drifts** 
+
+ `requirements.txt` specifies python dependencies, and `terraform_sync.py` contains python scripts to
+investigate terraform event outputs stored from step 0 to detect and address schema drifts
+
+In the python scripts(`terraform_sync.py`), we firstly scan through the output by line to identify all the drifted tables and store their table names. 
+After storing the drifted table names and converted them into the table_id format:[gcp_project_id].[dataset_id].[table_id], we make API calls, to fetch the latest table schemas from BigQuery. 
+
+**Step 2: Fail Builds and Notify Expected Schemas** 
+
+Once the schema drifts are detected and identified, we fail the build and notify the developer who makes changes to the repository. The notifications will include the details and the expected schemas in order keep the schema files up-to-date with the latest table schemas in BigQuery. 
+
+To interpret the message, the expected table schema is in the format of [{table1_id:table1_schema}, {table2_id: table2_schema}, ...... ]. table_id falls in the format of [gcp_project_id].[dataset_id].[table_id] 
+
+#### Folder Structure ####
+This directory serves as a starting point for your cloud project with terraform-sync-tool as one of qa tools integrated.
+
+    .
+    ├── modules                     # Terraform modules directory
+    │   ├── bigquery                # Example Terraform BigQuery Setup
+    │   └── ...                     # Other modules setup you have
+    ├── qa                          # qa environment directory
+    │   ├── terragrunt.hcl      
+    │   └── terraform-sync-tool     # Tool terraform-sync-tool
+    │           ├── json_schemas    # Terraform schema files 
+    │           ├── terragrunt.hcl
+    │           └── ...
+    ├── cloudbuild.yaml             # Cloud Build configuration file
+    ├── deploy.sh                   # Build Step 0 - contains terragrunt commands
+    ├── requirements.txt            # Build Step 1 - Specifies python dependencies
+    ├── terraform_sync.py           # Build Step 1 - python scripts
+    └── ...                         # etc.
+
+**What is Terragrunt?**
+
+Terragrunt(https://terragrunt.gruntwork.io/docs/getting-started/install) is a framework on top of Terraform with some new tools out-of-the-box. 
+Using new files *.hcl and new keywords, you can share variables across terraform modules easily.
+
+**Using terragrunt to detect resource drifts**
+```
+terragrunt run-all plan -json --terragrunt-non-interactive
+
+# If you need to pass variables to specify working directory
+env = VALUE_OF_ENV  # Value of env, for example "qa"
+tool = VALUE_OF_TOOL  # Value of tool, for example "terraform-sync-tool"
+terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"
+
+# If you need to write outputs into a json file. Feel free to replace `plan_out.json` with your JSON FILENAME.
+terragrunt run-all plan -json --terragrunt-non-interactive > plan_out.json
+
+# If you need to write outputs into a json file with variables specified. 
+# Feel free to replace `plan_out.json` with your JSON FILENAME.
+env = VALUE_OF_ENV  # Value of env, for example "qa"
+tool = VALUE_OF_TOOL  # Value of tool, for example "terraform-sync-tool"
+terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}" > plan_out.json
+```
+
+## How to run this sample repo?
+
+#### Fork and Clone this repo
+
+#### Go to the directory you just cloned, and update
+
+- **YOUR_GCP_PROJECT_ID** in `./qa/terragrunt.hcl` 
+- **YOUR_BUCKET_NAME** in `./qa/terragrunt.hcl` 
+- **YOUR_DATASET_ID** in `./qa/terraform-sync-tool/terragrunt.hcl` 
+
+#### (First time only) Use terraform plan & apply to deploy your resource to you GCP Project
+
+```
+env = qa
+tool = terraform-sync-tool
+echo $env
+echo $tool
+terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"
+terragrunt run-all apply -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"
+```
+#### Create and configure a new Trigger in Cloud Build
+Make sure to indicate your cloud configuration file location correctly. 
+In this sample repo, use `tools/terraform_sync_tool/cloudbuild.yaml` as your cloud configuration file location
+
+### How to test each Build Step without triggering Cloud Build?
+
+- To test using terragrunt commands. Feel free to replace `plan_out.json` with your JSON FILENAME and change values of variables.
+```
+env = qa
+tool = terraform-sync-tool
+echo $env
+echo $tool
+terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}" > plan_out.json
+```
+
+- To test using terragrunt commands without writing the output into a JSON file
+```
+terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"
+```
+
+- To test python scripts. `terraform_sync.py` requires two arguments: JSON filename and gcp_project_id. Provide arguments to test `terraform_sync.py`. Feel free to replace `plan_out.json` with your JSON FILENAME.
+```
+terraform_sync.py plan_out.json <YOUR_GCP_PROJECT_ID>
+```
diff --git a/tools/terraform_sync_tool/architecture.png b/tools/terraform_sync_tool/architecture.png
diff --git a/tools/terraform_sync_tool/cloudbuild.yaml b/tools/terraform_sync_tool/cloudbuild.yaml
@@ -0,0 +1,15 @@
+steps:
+  # step 0: run terraform commands in deploy.sh to detects drifts
+  - name: 'alpine/terragrunt'
+    entrypoint: 'bash'
+    dir: './tools/terraform_sync_tool/'
+    args: ['deploy.sh', 'qa', 'terraform-sync-tool']
+
+  # step 1: run python scripts to investigate terraform output
+  - name: python:3.7
+    entrypoint: 'bash'
+    dir: './tools/terraform_sync_tool/'
+    args: 
+    - -c
+    - 'pip install -r ./requirements.txt'
+    - 'python terraform_sync.py plan_out.json <GCP_PROJECT_ID>'
diff --git a/tools/terraform_sync_tool/deploy.sh b/tools/terraform_sync_tool/deploy.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+env=$1
+tool=$2
+
+terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}" > plan_out.json
diff --git a/tools/terraform_sync_tool/modules/bigquery/main.tf b/tools/terraform_sync_tool/modules/bigquery/main.tf
@@ -0,0 +1,55 @@
+locals {
+  datasets        = { for dataset in var.datasets : dataset["dataset_id"] => dataset }
+  tables          = { for table in var.tables : table["table_id"] => table }
+
+  iam_to_primitive = {
+    "roles/bigquery.dataOwner" : "OWNER"
+    "roles/bigquery.dataEditor" : "WRITER"
+    "roles/bigquery.dataViewer" : "READER"
+  }
+}
+
+#this is the test for dataset list creation
+resource "google_bigquery_dataset" "bq_dataset" {
+  for_each        = local.datasets
+  friendly_name   = each.value["friendly_name"]
+  dataset_id      = each.key
+  location        = each.value["location"] 
+  project         = var.project_id
+}
+
+resource "google_bigquery_table" "bq_table" {
+  for_each        = local.tables
+  dataset_id      = each.value["dataset_id"]
+  friendly_name   = each.key
+  table_id        = each.key
+  labels          = each.value["labels"]
+  schema          = file(each.value["schema"])
+  clustering      = each.value["clustering"]
+  expiration_time = each.value["expiration_time"]
+  project         = var.project_id
+  deletion_protection = each.value["deletion_protection"]
+  depends_on = [google_bigquery_dataset.bq_dataset]
+
+  dynamic "time_partitioning" {
+    for_each = each.value["time_partitioning"] != null ? [each.value["time_partitioning"]] : []
+    content {
+      type                     = time_partitioning.value["type"]
+      expiration_ms            = time_partitioning.value["expiration_ms"]
+      field                    = time_partitioning.value["field"]
+      require_partition_filter = time_partitioning.value["require_partition_filter"]
+    }
+  }
+
+  dynamic "range_partitioning" {
+    for_each = each.value["range_partitioning"] != null ? [each.value["range_partitioning"]] : []
+    content {
+      field = range_partitioning.value["field"]
+      range {
+        start    = range_partitioning.value["range"].start
+        end      = range_partitioning.value["range"].end
+        interval = range_partitioning.value["range"].interval
+      }
+    }
+  }
+}
diff --git a/tools/terraform_sync_tool/modules/bigquery/variables.tf b/tools/terraform_sync_tool/modules/bigquery/variables.tf
@@ -0,0 +1,56 @@
+variable "location" {
+  description = "The regional location for the dataset only US and EU are allowed in module"
+  type        = string
+  default     = "US"
+}
+
+variable "deletion_protection" {
+  description = "Whether or not to allow Terraform to destroy the instance. Unless this field is set to false in Terraform state, a terraform destroy or terraform apply that would delete the instance will fail."
+  type        = bool
+  default     = true
+  }
+
+variable "project_id" {
+  description = "Project where the dataset and table are created"
+  type        = string
+}
+
+variable "datasets" {
+  description = "this is a test DS"
+  default = []
+  type = list(object({
+        dataset_id = string
+        friendly_name = string
+        location = string
+   }
+  ))
+}
+
+variable "tables" {
+  description = "A list of objects which include table_id, schema, clustering, time_partitioning, expiration_time and labels."
+  default     = []
+  type = list(object({
+    table_id   = string,
+    dataset_id = string, #added to test creating multi dataset
+    schema     = string,
+    clustering = list(string),
+    deletion_protection=bool,
+    time_partitioning = object({
+      expiration_ms            = string,
+      field                    = string,
+      type                     = string,
+      require_partition_filter = bool,
+    }),
+	range_partitioning = object({
+      field = string,
+      range = object({
+        start    = string,
+        end      = string,
+        interval = string,
+      }),
+    }),
+    expiration_time = string,
+    labels          = map(string),
+  }
+  ))
+}
diff --git a/tools/terraform_sync_tool/qa/terraform-sync-tool/json_schemas/TableForTest.json b/tools/terraform_sync_tool/qa/terraform-sync-tool/json_schemas/TableForTest.json
@@ -0,0 +1,62 @@
+[
+  {
+    "description": "Col1",
+    "mode": "NULLABLE",
+    "name": "Col1",
+    "type": "STRING"
+  },
+  {
+    "description": "Col2",
+    "mode": "NULLABLE",
+    "name": "Col2",
+    "type": "STRING"
+  },
+  {
+    "description": "Col3",
+    "mode": "NULLABLE",
+    "name": "Col3",
+    "type": "STRING"
+  },
+  {
+    "description": "Col4",
+    "mode": "NULLABLE",
+    "name": "Col4",
+    "type": "STRING"
+  },
+  {
+    "description": "Col5",
+    "mode": "NULLABLE",
+    "name": "Col5",
+    "type": "STRING"
+  },
+  {
+    "description": "Col6",
+    "mode": "NULLABLE",
+    "name": "Col6",
+    "type": "STRING"
+  },
+  {
+    "description": "Col7",
+    "mode": "NULLABLE",
+    "name": "Col7",
+    "type": "STRING"
+  },
+  {
+    "description": "Col8",
+    "mode": "NULLABLE",
+    "name": "Col8",
+    "type": "STRING"
+  },
+  {
+    "description": "Col9",
+    "mode": "NULLABLE",
+    "name": "Col9",
+    "type": "STRING"
+  },
+  {
+    "description": "Col10",
+    "mode": "NULLABLE",
+    "name": "Col10", 
+    "type": "STRING"
+  }
+]
diff --git a/tools/terraform_sync_tool/qa/terraform-sync-tool/json_schemas/TableForTest2.json b/tools/terraform_sync_tool/qa/terraform-sync-tool/json_schemas/TableForTest2.json
@@ -0,0 +1,26 @@
+[
+  {
+    "description": "Col1",
+    "mode": "NULLABLE",
+    "name": "Col1",
+    "type": "STRING"
+  },
+  {
+    "description": "Col2",
+    "mode": "NULLABLE",
+    "name": "Col2",
+    "type": "STRING"
+  },
+  {
+    "description": "Col3",
+    "mode": "NULLABLE",
+    "name": "Col3",
+    "type": "STRING"
+  },
+  {
+    "description": "Col4",
+    "mode": "NULLABLE",
+    "name": "Col4",
+    "type": "STRING"
+  }
+]