Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform sync tool #290

Draft
wants to merge 53 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
cf9653f
inital commit
candicehou07 May 11, 2022
fbca216
create sample dataset in BigQeury
candicehou07 May 11, 2022
313418f
add cloudbuild.yaml
candicehou07 May 19, 2022
c1ccc83
table json schema
candicehou07 May 22, 2022
1a69c02
Delete bq_schema.json
candicehou07 Jun 6, 2022
feb176f
Delete BatchControlLog.json
candicehou07 Jun 6, 2022
7c42756
Merge branch 'GoogleCloudPlatform:master' into terraform_sync_tool
candicehou07 Jun 8, 2022
d94c55b
terraform sync tool
candicehou07 Jun 8, 2022
345ab63
change dir
candicehou07 Jun 8, 2022
3cfcbc2
specify dir from yaml file
candicehou07 Jun 8, 2022
8ae5b48
add step1 to run python scripts
candicehou07 Jun 8, 2022
4d5ac73
Delete auto generated contents
candicehou07 Jun 8, 2022
7c8e1d1
Update tools/terraform_sync_tool/deploy.sh output json file
candicehou07 Jun 9, 2022
8b690c3
Update tools/terraform_sync_tool/qa/terragrunt.hcl
candicehou07 Jun 9, 2022
c2fc146
Delete state.json
candicehou07 Jun 9, 2022
dd75e1f
Update schema JSON
candicehou07 Jun 9, 2022
2a354f0
Rename module to bq-setup
candicehou07 Jun 9, 2022
376b7fd
Rename JSON output file
candicehou07 Jun 9, 2022
7a33b29
Refractor python scripts
candicehou07 Jun 13, 2022
9a5fe09
Move Converting table names -> table ids in Main()
candicehou07 Jun 13, 2022
5c36be7
Accept user-provided arguments
candicehou07 Jun 13, 2022
d330bb9
Update yaml file
candicehou07 Jun 13, 2022
252dcab
Clean up yaml file
candicehou07 Jun 13, 2022
abcf98f
Update identifiers and refractor terraform_sync.py
candicehou07 Jun 13, 2022
2422a88
Add comments to get_drifted_tables()
candicehou07 Jun 13, 2022
3fca695
Add developer's TODOs: Update project ID and dataset ID
candicehou07 Jun 13, 2022
485bde1
Add README file
candicehou07 Jun 14, 2022
52467af
Update README: add folder structure section
candicehou07 Jun 14, 2022
9211697
Update README: add Cloud Build setup instruction
candicehou07 Jun 14, 2022
5dabf46
Add Comments to terraform_sync.py
candicehou07 Jun 15, 2022
e85bd21
Rename module to bigquery & provide default datasetID
candicehou07 Jun 26, 2022
073e192
Rename prefix path
candicehou07 Jun 26, 2022
f699fdd
Use provided bucket name
candicehou07 Jun 26, 2022
c89fee7
Clean out variables unused
candicehou07 Jun 26, 2022
1e65ae7
Test with mutiple tables and update python scripts to handle muti tables
candicehou07 Jun 26, 2022
ec2d139
Update python scripts - Add check resource_condensed exists condition
candicehou07 Jun 26, 2022
90cb7da
Initiate BigQuery client in main()
candicehou07 Jun 28, 2022
23f1701
Use regex to convert id_value to fully-qualified table_id
candicehou07 Jul 1, 2022
993bffa
Update Prerequisite in README
candicehou07 Jul 11, 2022
d31e25f
Update README
candicehou07 Jul 11, 2022
1e7a6fc
Derive project ID from default credentials
candicehou07 Jul 11, 2022
67505be
Update user-provided arguments description
candicehou07 Jul 11, 2022
69a2df3
Update terraform_sync.py
candicehou07 Jul 11, 2022
9cd0da7
Allow users to provide project_id to ArgumentParser
candicehou07 Jul 12, 2022
1c4fc9f
Formatting: add new line at end of files
candicehou07 Jul 18, 2022
d2e5e0e
Formatting schema json files
candicehou07 Jul 18, 2022
815bee4
Update README: provide more details on setup
candicehou07 Jul 19, 2022
2bf25ea
Update README
candicehou07 Jul 19, 2022
4e1bb73
Update cloudbuild.yaml: provide project_id
candicehou07 Jul 19, 2022
2b20ed5
Reorder and update README
candicehou07 Jul 25, 2022
7f469e4
Update README Introduction
candicehou07 Aug 30, 2022
5496102
Update README: update structure
candicehou07 Aug 30, 2022
34f1461
Update README: fix
candicehou07 Aug 30, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions tools/terraform_sync_tool/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Terraform Sync Tool

This directory contains the Terraform Sync Tool. This tool intentionally fails your CI/CD pipeline when schema drifts occur between
what your BigQuery Terraform resources declare and what's actually present in your BigQuery environment.
Theses schema drifts happen when BigQuery tables are updated by processes outside of Terraform (ETL process may dynamically add new columns when loading data into BigQuery).
When drifts occur, you end up with outdated BigQuery Terraform resource files. This tool detects the schema drifts,
traces the origins of the drifts, and alerts developers/data engineers (by failing the CI/CD pipeline)
so they can patch the Terraform in their current commit.


Terraform Sync Tool can be integrated into your CI/CD pipeline. You'll need to add two steps to CI/CD pipeline.
- Step 0: Run the Terraform plan command (using either Terraform/Terragrunt) with the `-json` option and write the output into a JSON file using the caret operator `> output.json`
- Step 1: Use Python scripts to identify and investigate the drifts

## How to run Terraform Schema Sync Tool
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## How to run Terraform Schema Sync Tool
## How to run Terraform Schema Sync Tool
```bash
###############
# Using Terragrunt
###############
terragrunt run-all plan -json --terragrunt-non-interactive > plan_output.json
python3 terraform_sync.py plan_output.json
##############
# Using Terraform
##############
terraform plan -json > plan_output.json
python3 terraform_sync.py plan_output.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5496102 and 34f1461


```bash
###############
# Using Terragrunt
###############
terragrunt run-all plan -json --terragrunt-non-interactive > plan_output.json
python3 terraform_sync.py plan_output.json <YOUR_GCP_PROJECT_ID>
##############
# Using Terraform
##############
terraform plan -json > plan_output.json
python3 terraform_sync.py plan_output.json <YOUR_GCP_PROJECT_ID>
```

## How Terraform Schema Sync Tool Works

![Architecture Diagram](architecture.png)

**Executing the Sync Tool**

The Terraform Sync Tool will be executed as part of the CI/CD pipeline build steps triggered anytime when developers make a change to the linked repository. A build step specifies an action that you want Cloud Build to perform. For each build step, Cloud Build executes a docker container as an instance of docker run.

**Step 0: Terraform Detects Drifts**

`deploy.sh` contains terraform plan command that writes event outputs into `plan_out.json` file. We'll use `plan_out.json` file for further investigation in the future steps. Feel free to repace `plan_out.json` with your JSON filename. we can pass through the variables ${env} and ${tool} if any.

**Step 1: Investigate Drifts**

`requirements.txt` specifies python dependencies, and `terraform_sync.py` contains python scripts to
investigate terraform event outputs stored from step 0 to detect and address schema drifts

In the python scripts(`terraform_sync.py`), we firstly scan through the output by line to identify all the drifted tables and store their table names.
After storing the drifted table names and converted them into the table_id format:[gcp_project_id].[dataset_id].[table_id], we make API calls, to fetch the latest table schemas from BigQuery.

**Step 2: Fail Builds and Notify Expected Schemas**

Once the schema drifts are detected and identified, we fail the build and notify the developer who makes changes to the repository. The notifications will include the details and the expected schemas in order keep the schema files up-to-date with the latest table schemas in BigQuery.

To interpret the message, the expected table schema is in the format of [{table1_id:table1_schema}, {table2_id: table2_schema}, ...... ]. table_id falls in the format of [gcp_project_id].[dataset_id].[table_id]

**What is Terragrunt?**

Terragrunt(https://terragrunt.gruntwork.io/docs/getting-started/install) is a framework on top of Terraform with some new tools out-of-the-box.
Using new files *.hcl and new keywords, you can share variables across terraform modules easily.

## How to run this sample repo?

#### Fork and Clone this repo

#### Folder Structure
This directory serves as a starting point for your cloud project with terraform-sync-tool as one of qa tools integrated.

.
├── modules # Terraform modules directory
│ ├── bigquery # Example Terraform BigQuery Setup
│ └── ... # Other modules setup you have
├── qa # qa environment directory
│ ├── terragrunt.hcl
│ └── terraform-sync-tool # Tool terraform-sync-tool
│ ├── json_schemas # Terraform schema files
│ ├── terragrunt.hcl
│ └── ...
├── cloudbuild.yaml # Cloud Build configuration file
├── deploy.sh # Build Step 0 - contains terragrunt commands
Comment on lines +69 to +79
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these should go inside an example/ directory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5496102

├── requirements.txt # Build Step 1 - Specifies python dependencies
├── terraform_sync.py # Build Step 1 - python scripts
└── ... # etc.

#### Go to the directory you just cloned, and update

- **YOUR_GCP_PROJECT_ID** in `./qa/terragrunt.hcl`
- **YOUR_BUCKET_NAME** in `./qa/terragrunt.hcl`
- **YOUR_DATASET_ID** in `./qa/terraform-sync-tool/terragrunt.hcl`

### Use Terraform/Terragrunt commands to test if any resources drifts existed

Terragrunt/Terraform commands:
```
terragrunt run-all plan -json --terragrunt-non-interactive

# Terraform Command
terraform plan -json
```

After running the Terrform plan command, **the event type "resource_drift"("type": "resource_drift") indicates a drift has occurred**.
If drifts detected, please update your terraform configurations and address the resource drifts based on the event outputs.


#### Add Could Build Steps to your configuration file

Please check cloud build steps in `cloudbuild.yaml` file, and add these steps to your Cloud Build Configuration File.

- step 0: run terraform commands in `deploy.sh` to detects drifts

Add `deploy.sh` to your project directory.

- step 1: run python scripts to investigate terraform output

Add `requirements.txt` and `terraform_sync.py` to your project directory.

#### (Optional if you haven't created a Cloud Build Trigger) Create and configure a new Trigger in Cloud Build
Make sure to indicate your cloud configuration file location correctly. In this sample repo, use `tools/terraform_sync_tool/cloudbuild.yaml` as your cloud configuration file location

#### That's all you need! Let's commit and test in Cloud Build!
Binary file added tools/terraform_sync_tool/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions tools/terraform_sync_tool/cloudbuild.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
steps:
# step 0: run terraform commands in deploy.sh to detects drifts
- name: 'alpine/terragrunt'
entrypoint: 'bash'
dir: './tools/terraform_sync_tool/'
args: ['deploy.sh', 'qa', 'terraform-sync-tool']

# step 1: run python scripts to investigate terraform output
- name: python:3.7
entrypoint: 'bash'
dir: './tools/terraform_sync_tool/'
args:
- -c
- 'pip install -r ./requirements.txt'
- 'python terraform_sync.py plan_out.json <GCP_PROJECT_ID>'
6 changes: 6 additions & 0 deletions tools/terraform_sync_tool/deploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

env=$1
tool=$2

terragrunt run-all plan -json --terragrunt-non-interactive --terragrunt-working-dir="${env}"/"${tool}" > plan_out.json
55 changes: 55 additions & 0 deletions tools/terraform_sync_tool/modules/bigquery/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
locals {
datasets = { for dataset in var.datasets : dataset["dataset_id"] => dataset }
tables = { for table in var.tables : table["table_id"] => table }

iam_to_primitive = {
"roles/bigquery.dataOwner" : "OWNER"
"roles/bigquery.dataEditor" : "WRITER"
"roles/bigquery.dataViewer" : "READER"
}
}

#this is the test for dataset list creation
resource "google_bigquery_dataset" "bq_dataset" {
for_each = local.datasets
friendly_name = each.value["friendly_name"]
dataset_id = each.key
location = each.value["location"]
project = var.project_id
}

resource "google_bigquery_table" "bq_table" {
for_each = local.tables
dataset_id = each.value["dataset_id"]
friendly_name = each.key
table_id = each.key
labels = each.value["labels"]
schema = file(each.value["schema"])
clustering = each.value["clustering"]
expiration_time = each.value["expiration_time"]
project = var.project_id
deletion_protection = each.value["deletion_protection"]
depends_on = [google_bigquery_dataset.bq_dataset]

dynamic "time_partitioning" {
for_each = each.value["time_partitioning"] != null ? [each.value["time_partitioning"]] : []
content {
type = time_partitioning.value["type"]
expiration_ms = time_partitioning.value["expiration_ms"]
field = time_partitioning.value["field"]
require_partition_filter = time_partitioning.value["require_partition_filter"]
}
}

dynamic "range_partitioning" {
for_each = each.value["range_partitioning"] != null ? [each.value["range_partitioning"]] : []
content {
field = range_partitioning.value["field"]
range {
start = range_partitioning.value["range"].start
end = range_partitioning.value["range"].end
interval = range_partitioning.value["range"].interval
}
}
}
}
56 changes: 56 additions & 0 deletions tools/terraform_sync_tool/modules/bigquery/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
variable "location" {
description = "The regional location for the dataset only US and EU are allowed in module"
type = string
default = "US"
}

variable "deletion_protection" {
description = "Whether or not to allow Terraform to destroy the instance. Unless this field is set to false in Terraform state, a terraform destroy or terraform apply that would delete the instance will fail."
type = bool
default = true
}

variable "project_id" {
description = "Project where the dataset and table are created"
type = string
}

variable "datasets" {
description = "this is a test DS"
default = []
type = list(object({
dataset_id = string
friendly_name = string
location = string
}
))
}

variable "tables" {
description = "A list of objects which include table_id, schema, clustering, time_partitioning, expiration_time and labels."
default = []
type = list(object({
table_id = string,
dataset_id = string, #added to test creating multi dataset
schema = string,
clustering = list(string),
deletion_protection=bool,
time_partitioning = object({
expiration_ms = string,
field = string,
type = string,
require_partition_filter = bool,
}),
range_partitioning = object({
field = string,
range = object({
start = string,
end = string,
interval = string,
}),
}),
expiration_time = string,
labels = map(string),
}
))
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
[
{
"description": "Col1",
"mode": "NULLABLE",
"name": "Col1",
"type": "STRING"
},
{
"description": "Col2",
"mode": "NULLABLE",
"name": "Col2",
"type": "STRING"
},
{
"description": "Col3",
"mode": "NULLABLE",
"name": "Col3",
"type": "STRING"
},
{
"description": "Col4",
"mode": "NULLABLE",
"name": "Col4",
"type": "STRING"
},
{
"description": "Col5",
"mode": "NULLABLE",
"name": "Col5",
"type": "STRING"
},
{
"description": "Col6",
"mode": "NULLABLE",
"name": "Col6",
"type": "STRING"
},
{
"description": "Col7",
"mode": "NULLABLE",
"name": "Col7",
"type": "STRING"
},
{
"description": "Col8",
"mode": "NULLABLE",
"name": "Col8",
"type": "STRING"
},
{
"description": "Col9",
"mode": "NULLABLE",
"name": "Col9",
"type": "STRING"
},
{
"description": "Col10",
"mode": "NULLABLE",
"name": "Col10",
"type": "STRING"
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[
{
"description": "Col1",
"mode": "NULLABLE",
"name": "Col1",
"type": "STRING"
},
{
"description": "Col2",
"mode": "NULLABLE",
"name": "Col2",
"type": "STRING"
},
{
"description": "Col3",
"mode": "NULLABLE",
"name": "Col3",
"type": "STRING"
},
{
"description": "Col4",
"mode": "NULLABLE",
"name": "Col4",
"type": "STRING"
}
]
51 changes: 51 additions & 0 deletions tools/terraform_sync_tool/qa/terraform-sync-tool/terragrunt.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
terraform {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this file's parent folder from terraform-sync-tool to "bigquery"

source = "../../modules/bigquery"
}

include "root" {
path = find_in_parent_folders()
expose = true
}

locals {
# TODO: Update your dataset ID
dataset_id = "DatasetForTest" #YOUR_DATASET_ID
}

inputs = {
project_id = include.root.inputs.project_id
# The ID of the project in which the resource belongs. If it is not provided, the provider project is used.
datasets = [
{
dataset_id = "${local.dataset_id}"
friendly_name = "Dataset for Terraform Sync Tool"
location = "US"
labels = {}
}
]

tables = [
{
table_id = "TableForTest"
dataset_id = "${local.dataset_id}"
schema = "json_schemas/TableForTest.json"
clustering = []
expiration_time = null
deletion_protection = true
range_partitioning = null
time_partitioning = null
labels = {}
},
{
table_id = "TableForTest2"
dataset_id = "${local.dataset_id}"
schema = "json_schemas/TableForTest2.json"
clustering = []
expiration_time = null
deletion_protection = true
range_partitioning = null
time_partitioning = null
labels = {}
}
]
}
Loading