Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform sync tool #290

Draft
wants to merge 53 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
cf9653f
inital commit
candicehou07 May 11, 2022
fbca216
create sample dataset in BigQeury
candicehou07 May 11, 2022
313418f
add cloudbuild.yaml
candicehou07 May 19, 2022
c1ccc83
table json schema
candicehou07 May 22, 2022
1a69c02
Delete bq_schema.json
candicehou07 Jun 6, 2022
feb176f
Delete BatchControlLog.json
candicehou07 Jun 6, 2022
7c42756
Merge branch 'GoogleCloudPlatform:master' into terraform_sync_tool
candicehou07 Jun 8, 2022
d94c55b
terraform sync tool
candicehou07 Jun 8, 2022
345ab63
change dir
candicehou07 Jun 8, 2022
3cfcbc2
specify dir from yaml file
candicehou07 Jun 8, 2022
8ae5b48
add step1 to run python scripts
candicehou07 Jun 8, 2022
4d5ac73
Delete auto generated contents
candicehou07 Jun 8, 2022
7c8e1d1
Update tools/terraform_sync_tool/deploy.sh output json file
candicehou07 Jun 9, 2022
8b690c3
Update tools/terraform_sync_tool/qa/terragrunt.hcl
candicehou07 Jun 9, 2022
c2fc146
Delete state.json
candicehou07 Jun 9, 2022
dd75e1f
Update schema JSON
candicehou07 Jun 9, 2022
2a354f0
Rename module to bq-setup
candicehou07 Jun 9, 2022
376b7fd
Rename JSON output file
candicehou07 Jun 9, 2022
7a33b29
Refractor python scripts
candicehou07 Jun 13, 2022
9a5fe09
Move Converting table names -> table ids in Main()
candicehou07 Jun 13, 2022
5c36be7
Accept user-provided arguments
candicehou07 Jun 13, 2022
d330bb9
Update yaml file
candicehou07 Jun 13, 2022
252dcab
Clean up yaml file
candicehou07 Jun 13, 2022
abcf98f
Update identifiers and refractor terraform_sync.py
candicehou07 Jun 13, 2022
2422a88
Add comments to get_drifted_tables()
candicehou07 Jun 13, 2022
3fca695
Add developer's TODOs: Update project ID and dataset ID
candicehou07 Jun 13, 2022
485bde1
Add README file
candicehou07 Jun 14, 2022
52467af
Update README: add folder structure section
candicehou07 Jun 14, 2022
9211697
Update README: add Cloud Build setup instruction
candicehou07 Jun 14, 2022
5dabf46
Add Comments to terraform_sync.py
candicehou07 Jun 15, 2022
e85bd21
Rename module to bigquery & provide default datasetID
candicehou07 Jun 26, 2022
073e192
Rename prefix path
candicehou07 Jun 26, 2022
f699fdd
Use provided bucket name
candicehou07 Jun 26, 2022
c89fee7
Clean out variables unused
candicehou07 Jun 26, 2022
1e65ae7
Test with mutiple tables and update python scripts to handle muti tables
candicehou07 Jun 26, 2022
ec2d139
Update python scripts - Add check resource_condensed exists condition
candicehou07 Jun 26, 2022
90cb7da
Initiate BigQuery client in main()
candicehou07 Jun 28, 2022
23f1701
Use regex to convert id_value to fully-qualified table_id
candicehou07 Jul 1, 2022
993bffa
Update Prerequisite in README
candicehou07 Jul 11, 2022
d31e25f
Update README
candicehou07 Jul 11, 2022
1e7a6fc
Derive project ID from default credentials
candicehou07 Jul 11, 2022
67505be
Update user-provided arguments description
candicehou07 Jul 11, 2022
69a2df3
Update terraform_sync.py
candicehou07 Jul 11, 2022
9cd0da7
Allow users to provide project_id to ArgumentParser
candicehou07 Jul 12, 2022
1c4fc9f
Formatting: add new line at end of files
candicehou07 Jul 18, 2022
d2e5e0e
Formatting schema json files
candicehou07 Jul 18, 2022
815bee4
Update README: provide more details on setup
candicehou07 Jul 19, 2022
2bf25ea
Update README
candicehou07 Jul 19, 2022
4e1bb73
Update cloudbuild.yaml: provide project_id
candicehou07 Jul 19, 2022
2b20ed5
Reorder and update README
candicehou07 Jul 25, 2022
7f469e4
Update README Introduction
candicehou07 Aug 30, 2022
5496102
Update README: update structure
candicehou07 Aug 30, 2022
34f1461
Update README: fix
candicehou07 Aug 30, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions tools/terraform_sync_tool/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Terraform Sync Tool

This directory contains the setup for the Terraform Sync Tool. Terraform Sync Tool was designed to address the schema drifts in BigQuery tables and keep the
Terraform schemas up-to-date with the BigQuery table schemas in production environment. Schema drifts occurred when BigQuery Table schemas are updated by newly
ingested data while Terraform schema files contain the outdated schemas. Therefore, this tool will detect the schema drifts, trace the origins of the drifts, and alert
developers/data engineers.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This directory contains the setup for the Terraform Sync Tool. Terraform Sync Tool was designed to address the schema drifts in BigQuery tables and keep the
Terraform schemas up-to-date with the BigQuery table schemas in production environment. Schema drifts occurred when BigQuery Table schemas are updated by newly
ingested data while Terraform schema files contain the outdated schemas. Therefore, this tool will detect the schema drifts, trace the origins of the drifts, and alert
developers/data engineers.
This directory contains the Terraform Sync Tool. This tool intentionally fails your CI/CD pipeline when schema drifts occur between what your BigQuery Terraform resources declare and what's actually present in your BigQuery environment. Theses schema drifts happen when BigQuery tables are updated by processes outside of Terraform (ETL process may dynamically add new columns when loading data into BigQuery). When drifts occur, you end up with outdated BigQuery Terraform resource files. This tool detects the schema drifts, traces the origins of the drifts, and alerts developers/data engineers (by failing the CI/CD pipeline) so they can patch the Terraform in their current commit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 7f469e4


The Terraform Schema Sync Tool fails the build attemps if resource drifts are detected and notifies the latest resource information. Developers and data engineers should be able to update the Terraform resources accordingly.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Terraform Schema Sync Tool fails the build attemps if resource drifts are detected and notifies the latest resource information. Developers and data engineers should be able to update the Terraform resources accordingly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 7f469e4


Terraform Sync Tool can be integrated into your CI/CD pipeline. You'll need to add two steps to CI/CD pipeline.
- Step 0: Use Terraform/erragrunt command to detect resource drifts and write output into a JSON file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Step 0: Use Terraform/erragrunt command to detect resource drifts and write output into a JSON file
- Step 0: Run the Terraform plan command (using either Terraform/Terragrunt) with the `-json` option and write the output into a JSON file using the caret operator `> output.json`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 7f469e4

- Step 1: Use Python scripts to identify and investigate the drifts

## How to run Terraform Schema Sync Tool
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## How to run Terraform Schema Sync Tool
## How to run Terraform Schema Sync Tool
```bash
###############
# Using Terragrunt
###############
terragrunt run-all plan -json --terragrunt-non-interactive > plan_output.json
python3 terraform_sync.py plan_output.json
##############
# Using Terraform
##############
terraform plan -json > plan_output.json
python3 terraform_sync.py plan_output.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5496102 and 34f1461


#### Use Terraform/Terragrunt commands to test if any resources drifts existed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these should be 3rd level headers (###), not 4 (####)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5496102


Terragrunt/Terraform commands:
```
terragrunt run-all plan -json --terragrunt-non-interactive

# Terraform Command
terraform plan -json
```

After running the Terrform plan command, **the event type "resource_drift"("type": "resource_drift") indicates a drift has occurred**.
If drifts detected, please update your terraform configurations and address the resource drifts based on the event outputs.


#### Add Could Build Steps to your configuration file

Please check cloud build steps in `cloudbuild.yaml` file, and add these steps to your Cloud Build Configuration File.

- step 0: run terraform commands in `deploy.sh` to detects drifts

Add `deploy.sh` to your project directory.

- step 1: run python scripts to investigate terraform output

Add `requirements.txt` and `terraform_sync.py` to your project directory.

#### (Optional if you haven't created Cloud Build Trigger) Create and configure a new Trigger in Cloud Build
Make sure to indicate your cloud configuration file location correctly.

#### That's all you need! Let's commit and test in CLoud Build!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Terragrunt/Terraform commands:
```
terragrunt run-all plan -json --terragrunt-non-interactive
# Terraform Command
terraform plan -json
```
After running the Terrform plan command, **the event type "resource_drift"("type": "resource_drift") indicates a drift has occurred**.
If drifts detected, please update your terraform configurations and address the resource drifts based on the event outputs.
#### Add Could Build Steps to your configuration file
Please check cloud build steps in `cloudbuild.yaml` file, and add these steps to your Cloud Build Configuration File.
- step 0: run terraform commands in `deploy.sh` to detects drifts
Add `deploy.sh` to your project directory.
- step 1: run python scripts to investigate terraform output
Add `requirements.txt` and `terraform_sync.py` to your project directory.
#### (Optional if you haven't created Cloud Build Trigger) Create and configure a new Trigger in Cloud Build
Make sure to indicate your cloud configuration file location correctly.
#### That's all you need! Let's commit and test in CLoud Build!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not required to run this tool. This is more related to running the example you provide with this tool.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5496102

## How Terraform Schema Sync Tool Works

![Architecture Diagram](architecture.png)

**Executing the Sync Tool**

The Terraform Sync Tool will be executed as part of the CI/CD pipeline build steps triggered anytime when developers make a change to the linked repository. A build step specifies an action that you want Cloud Build to perform. For each build step, Cloud Build executes a docker container as an instance of docker run.

**Step 0: Terraform Detects Drifts**

`deploy.sh` contains terraform plan command that writes event outputs into `plan_out.json` file. We'll use `plan_out.json` file for further investigation in the future steps. Feel free to repace `plan_out.json` with your JSON filename. we can pass through the variables ${env} and ${tool} if any.

**Step 1: Investigate Drifts**

`requirements.txt` specifies python dependencies, and `terraform_sync.py` contains python scripts to
investigate terraform event outputs stored from step 0 to detect and address schema drifts

In the python scripts(`terraform_sync.py`), we firstly scan through the output by line to identify all the drifted tables and store their table names.
After storing the drifted table names and converted them into the table_id format:[gcp_project_id].[dataset_id].[table_id], we make API calls, to fetch the latest table schemas from BigQuery.

**Step 2: Fail Builds and Notify Expected Schemas**

Once the schema drifts are detected and identified, we fail the build and notify the developer who makes changes to the repository. The notifications will include the details and the expected schemas in order keep the schema files up-to-date with the latest table schemas in BigQuery.

To interpret the message, the expected table schema is in the format of [{table1_id:table1_schema}, {table2_id: table2_schema}, ...... ]. table_id falls in the format of [gcp_project_id].[dataset_id].[table_id]

#### Folder Structure ####
This directory serves as a starting point for your cloud project with terraform-sync-tool as one of qa tools integrated.

.
├── modules # Terraform modules directory
│ ├── bigquery # Example Terraform BigQuery Setup
│ └── ... # Other modules setup you have
├── qa # qa environment directory
│ ├── terragrunt.hcl
│ └── terraform-sync-tool # Tool terraform-sync-tool
│ ├── json_schemas # Terraform schema files
│ ├── terragrunt.hcl
│ └── ...
├── cloudbuild.yaml # Cloud Build configuration file
├── deploy.sh # Build Step 0 - contains terragrunt commands
Comment on lines +69 to +79
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these should go inside an example/ directory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5496102

├── requirements.txt # Build Step 1 - Specifies python dependencies
├── terraform_sync.py # Build Step 1 - python scripts
└── ... # etc.

**What is Terragrunt?**

Terragrunt(https://terragrunt.gruntwork.io/docs/getting-started/install) is a framework on top of Terraform with some new tools out-of-the-box.
Using new files *.hcl and new keywords, you can share variables across terraform modules easily.

**Using terragrunt to detect resource drifts**
```
terragrunt run-all plan -json --terragrunt-non-interactive

# If you need to pass variables to specify working directory
env = VALUE_OF_ENV # Value of env, for example "qa"
tool = VALUE_OF_TOOL # Value of tool, for example "terraform-sync-tool"
terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"

# If you need to write outputs into a json file. Feel free to replace `plan_out.json` with your JSON FILENAME.
terragrunt run-all plan -json --terragrunt-non-interactive > plan_out.json

# If you need to write outputs into a json file with variables specified.
# Feel free to replace `plan_out.json` with your JSON FILENAME.
env = VALUE_OF_ENV # Value of env, for example "qa"
tool = VALUE_OF_TOOL # Value of tool, for example "terraform-sync-tool"
terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}" > plan_out.json
```

## How to run this sample repo?

#### Fork and Clone this repo

#### Go to the directory you just cloned, and update

- **YOUR_GCP_PROJECT_ID** in `./qa/terragrunt.hcl`
- **YOUR_BUCKET_NAME** in `./qa/terragrunt.hcl`
- **YOUR_DATASET_ID** in `./qa/terraform-sync-tool/terragrunt.hcl`

#### (First time only) Use terraform plan & apply to deploy your resource to you GCP Project

```
env = qa
tool = terraform-sync-tool
echo $env
echo $tool
terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"
terragrunt run-all apply -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"
```
#### Create and configure a new Trigger in Cloud Build
Make sure to indicate your cloud configuration file location correctly.
In this sample repo, use `tools/terraform_sync_tool/cloudbuild.yaml` as your cloud configuration file location

### How to test each Build Step without triggering Cloud Build?

- To test using terragrunt commands. Feel free to replace `plan_out.json` with your JSON FILENAME and change values of variables.
```
env = qa
tool = terraform-sync-tool
echo $env
echo $tool
terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}" > plan_out.json
```

- To test using terragrunt commands without writing the output into a JSON file
```
terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}"
```

- To test python scripts. `terraform_sync.py` requires two arguments: JSON filename and gcp_project_id. Provide arguments to test `terraform_sync.py`. Feel free to replace `plan_out.json` with your JSON FILENAME.
```
terraform_sync.py plan_out.json <YOUR_GCP_PROJECT_ID>
```
Binary file added tools/terraform_sync_tool/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions tools/terraform_sync_tool/cloudbuild.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
steps:
# step 0: run terraform commands in deploy.sh to detects drifts
- name: 'alpine/terragrunt'
entrypoint: 'bash'
dir: './tools/terraform_sync_tool/'
args: ['deploy.sh', 'qa', 'terraform-sync-tool']

# step 1: run python scripts to investigate terraform output
- name: python:3.7
entrypoint: 'bash'
dir: './tools/terraform_sync_tool/'
args:
- -c
- 'pip install -r ./requirements.txt'
- 'python terraform_sync.py plan_out.json <GCP_PROJECT_ID>'
6 changes: 6 additions & 0 deletions tools/terraform_sync_tool/deploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

env=$1
tool=$2

terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}" > plan_out.json
55 changes: 55 additions & 0 deletions tools/terraform_sync_tool/modules/bigquery/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
locals {
datasets = { for dataset in var.datasets : dataset["dataset_id"] => dataset }
tables = { for table in var.tables : table["table_id"] => table }

iam_to_primitive = {
"roles/bigquery.dataOwner" : "OWNER"
"roles/bigquery.dataEditor" : "WRITER"
"roles/bigquery.dataViewer" : "READER"
}
}

#this is the test for dataset list creation
resource "google_bigquery_dataset" "bq_dataset" {
for_each = local.datasets
friendly_name = each.value["friendly_name"]
dataset_id = each.key
location = each.value["location"]
project = var.project_id
}

resource "google_bigquery_table" "bq_table" {
for_each = local.tables
dataset_id = each.value["dataset_id"]
friendly_name = each.key
table_id = each.key
labels = each.value["labels"]
schema = file(each.value["schema"])
clustering = each.value["clustering"]
expiration_time = each.value["expiration_time"]
project = var.project_id
deletion_protection = each.value["deletion_protection"]
depends_on = [google_bigquery_dataset.bq_dataset]

dynamic "time_partitioning" {
for_each = each.value["time_partitioning"] != null ? [each.value["time_partitioning"]] : []
content {
type = time_partitioning.value["type"]
expiration_ms = time_partitioning.value["expiration_ms"]
field = time_partitioning.value["field"]
require_partition_filter = time_partitioning.value["require_partition_filter"]
}
}

dynamic "range_partitioning" {
for_each = each.value["range_partitioning"] != null ? [each.value["range_partitioning"]] : []
content {
field = range_partitioning.value["field"]
range {
start = range_partitioning.value["range"].start
end = range_partitioning.value["range"].end
interval = range_partitioning.value["range"].interval
}
}
}
}
56 changes: 56 additions & 0 deletions tools/terraform_sync_tool/modules/bigquery/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
variable "location" {
description = "The regional location for the dataset only US and EU are allowed in module"
type = string
default = "US"
}

variable "deletion_protection" {
description = "Whether or not to allow Terraform to destroy the instance. Unless this field is set to false in Terraform state, a terraform destroy or terraform apply that would delete the instance will fail."
type = bool
default = true
}

variable "project_id" {
description = "Project where the dataset and table are created"
type = string
}

variable "datasets" {
description = "this is a test DS"
default = []
type = list(object({
dataset_id = string
friendly_name = string
location = string
}
))
}

variable "tables" {
description = "A list of objects which include table_id, schema, clustering, time_partitioning, expiration_time and labels."
default = []
type = list(object({
table_id = string,
dataset_id = string, #added to test creating multi dataset
schema = string,
clustering = list(string),
deletion_protection=bool,
time_partitioning = object({
expiration_ms = string,
field = string,
type = string,
require_partition_filter = bool,
}),
range_partitioning = object({
field = string,
range = object({
start = string,
end = string,
interval = string,
}),
}),
expiration_time = string,
labels = map(string),
}
))
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
[
{
"description": "Col1",
"mode": "NULLABLE",
"name": "Col1",
"type": "STRING"
},
{
"description": "Col2",
"mode": "NULLABLE",
"name": "Col2",
"type": "STRING"
},
{
"description": "Col3",
"mode": "NULLABLE",
"name": "Col3",
"type": "STRING"
},
{
"description": "Col4",
"mode": "NULLABLE",
"name": "Col4",
"type": "STRING"
},
{
"description": "Col5",
"mode": "NULLABLE",
"name": "Col5",
"type": "STRING"
},
{
"description": "Col6",
"mode": "NULLABLE",
"name": "Col6",
"type": "STRING"
},
{
"description": "Col7",
"mode": "NULLABLE",
"name": "Col7",
"type": "STRING"
},
{
"description": "Col8",
"mode": "NULLABLE",
"name": "Col8",
"type": "STRING"
},
{
"description": "Col9",
"mode": "NULLABLE",
"name": "Col9",
"type": "STRING"
},
{
"description": "Col10",
"mode": "NULLABLE",
"name": "Col10",
"type": "STRING"
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[
{
"description": "Col1",
"mode": "NULLABLE",
"name": "Col1",
"type": "STRING"
},
{
"description": "Col2",
"mode": "NULLABLE",
"name": "Col2",
"type": "STRING"
},
{
"description": "Col3",
"mode": "NULLABLE",
"name": "Col3",
"type": "STRING"
},
{
"description": "Col4",
"mode": "NULLABLE",
"name": "Col4",
"type": "STRING"
}
]
Loading