Skip to content

Commit 23635b0

Browse files
authored
Adding the Azure Datalake modules (#65)
* Adding the Azure modules and ReadMe files * Refactoring the files in Azure and Aws folders
1 parent 5653256 commit 23635b0

File tree

36 files changed

+1005
-233
lines changed

36 files changed

+1005
-233
lines changed

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ Temporary Items
3939
/pkg/
4040
/spec/reports/
4141
/spec/examples.txt
42-
/test/tmp/
43-
/test/version_tmp/
42+
/aws_datalake/test/tmp/
43+
/aws_datalake/test/version_tmp/
4444
/tmp/
4545

4646
# Used by dotenv library to load environment variables.

README.md

Lines changed: 4 additions & 231 deletions
Original file line numberDiff line numberDiff line change
@@ -1,209 +1,12 @@
1-
# terraform-aws-data-lake
1+
# terraform-data-lake
22

3-
Terraform modules which create AWS resources for a Segment Data Lake.
4-
5-
# Prerequisites
6-
7-
* Authorized [AWS account](https://aws.amazon.com/account/).
8-
* Ability to run Terraform with your AWS Account. Terraform 0.12+ (you can download tfswitch to help with switching your terraform version)
9-
* A subnet within a VPC for the EMR cluster to run in.
10-
* An [S3 Bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) for Segment to load data into. You can create a new one just for this, or re-use an existing one you already have.
11-
12-
## VPC
13-
14-
You'll need to provide a subnet within a VPC for the EMR to cluster to run in. Here are some resources that can guide you through setting up a VPC for your EMR cluster:
15-
* https://aws.amazon.com/blogs/big-data/launching-and-running-an-amazon-emr-cluster-inside-a-vpc/
16-
* https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html
17-
* https://github.com/terraform-aws-modules/terraform-aws-vpc
18-
19-
# Modules
20-
21-
The repository is split into multiple modules, and each can be used independently:
22-
* [iam](/modules/iam) - IAM roles that give Segment access to your AWS resources.
23-
* [emr](/modules/emr) - EMR cluster that Segment can submit jobs to load events into your Data Lake.
24-
* [glue](/modules/glue) - Glue tables that Segment can write metadata to.
3+
Terraform modules which create AWS or Azure resources for a Segment Data Lake as required.
254

265
# Usage
276

28-
## Terraform Installation
29-
*Note* - Skip this section if you already have a working Terraform setup
30-
### OSX:
31-
`brew` on OSX should install the latest version of Terraform.
32-
```
33-
brew install terraform
34-
```
35-
36-
### Centos/Ubuntu:
37-
* Follow instructions [here](https://phoenixnap.com/kb/how-to-install-terraform-centos-ubuntu) to install on Centos/Ubuntu OS.
38-
* Ensure that the version installed in > 0.11.x
39-
40-
Verify installation works by running:
41-
```
42-
terraform help
43-
```
44-
45-
## Set up Project
46-
* Create project directory
47-
```
48-
mkdir segment-datalakes-tf
49-
```
50-
* Create `main.tf` file
51-
* Update the `external_ids` variable in the `locals` section to the workspace ID. This will allow all sources in the workspace to be synced to the Data Lake without any extra setup.
52-
* **Note - Existing users** may be using the `sourceID` here instead of the `workspaceID`, which was the previous configuration. Setting this value as the `sourceID` is still supported for existing users for backwards compatibility. Follow instructions [here](#Using-workspaceID-as-the-externalID) to migrate to `workspaceID`. This will ensure you do not need to update this value for each source you want to add.
53-
* Update the `name` in the `aws_s3_bucket` resource to the desired name of your S3 bucket
54-
* Update the `subnet_id` in the `emr` module to the subnet in which to create the EMR cluster
55-
56-
```hcl
57-
provider "aws" {
58-
# Replace this with the AWS region your infrastructure is set up in.
59-
region = "us-west-2"
60-
61-
version = "~> 4"
62-
}
63-
64-
locals {
65-
external_ids = {
66-
# Find these in the Segment UI. Only need to set this once for all sources in
67-
# the workspace
68-
# - Settings > General Settings
69-
<Workspace Name> = "<Workspace ID>"
70-
}
71-
72-
# Add the names of the glue databases you want to setup lake-formation for.
73-
74-
# glue_db_list = toset([<db_names>])
75-
76-
}
77-
78-
# This is the target where Segment will write your data.
79-
# You can skip this if you already have an S3 bucket and just reference that name manually later.
80-
# If you decide to skip this and use an existing bucket, ensure that you attach a 14 day expiration lifecycle policy to
81-
# your S3 bucket for the "segment-stage/" prefix.
82-
resource "aws_s3_bucket" "segment_datalake_s3" {
83-
bucket = "my-first-segment-datalake"
84-
}
85-
86-
# Creates the IAM Policy that allows Segment to access the necessary resources
87-
# in your AWS account for loading your data.
88-
module "iam" {
89-
source = "[email protected]:segmentio/terraform-aws-data-lake//modules/iam?ref=v0.6.0"
90-
91-
# Suffix is not strictly required if only initializing this module once.
92-
# However, if you need to initialize multiple times across different Terraform
93-
# workspaces, this hook allows the generated IAM policies to be given unique
94-
# names.
95-
suffix = "-prod"
96-
97-
s3_bucket = aws_s3_bucket.segment_datalake_s3.id
98-
external_ids = values(local.external_ids)
99-
}
100-
101-
# Creates an EMR Cluster that Segment uses for performing the final ETL on your
102-
# data that lands in S3.
103-
module "emr" {
104-
source = "[email protected]:segmentio/terraform-aws-data-lake//modules/emr?ref=v0.6.0"
105-
106-
s3_bucket = aws_s3_bucket.segment_datalake_s3.id
107-
subnet_id = "subnet-XXX" # Replace this with the subnet ID you want the EMR cluster to run in.
108-
109-
# LEAVE THIS AS-IS
110-
iam_emr_autoscaling_role = module.iam.iam_emr_autoscaling_role
111-
iam_emr_service_role = module.iam.iam_emr_service_role
112-
iam_emr_instance_profile = module.iam.iam_emr_instance_profile
113-
}
7+
If you want to create an AWS datalake refer to the modules and setup guide in [aws_datalake](aws_datalake) folder
1148

115-
# Use the code below if you want to add lake-formation setup using terraform.
116-
# You need to use terraform >=0.13 to use the following setup
117-
# If you have teraform 0.12, you will have to run the setup separately for each database.
118-
# Add the names of glue databases to the glue_db_list variable in local.
119-
120-
# module "lakeformation_datalake_permissions" {
121-
# source = "[email protected]:segmentio/terraform-aws-data-lake//modules/lakeformation?ref=v0.6.0"
122-
# for_each = local.glue_db_list
123-
# name = each.key
124-
# iam_roles = {
125-
# datalake_role = module.iam.segment_datalake_iam_role_arn,
126-
# emr_instance_profile_role = module.iam.iam_emr_instance_profile
127-
# }
128-
# }
129-
```
130-
131-
## Provision Resources
132-
* Provide AWS credentials of the account being used. More details here: https://www.terraform.io/docs/providers/aws/index.html
133-
```
134-
export AWS_ACCESS_KEY_ID="anaccesskey"
135-
export AWS_SECRET_ACCESS_KEY="asecretkey"
136-
export AWS_DEFAULT_REGION="us-west-2"
137-
```
138-
* Initialize the references modules
139-
```
140-
terraform init
141-
```
142-
You should see a success message once you run the plan:
143-
```
144-
Terraform has been successfully initialized!
145-
```
146-
* Run plan
147-
This does not create any resources. It just outputs what will be created after you run apply(next step).
148-
```
149-
terraform plan
150-
```
151-
You should see something like towards the end of the plan:
152-
```
153-
Plan: 13 to add, 0 to change, 0 to destroy.
154-
```
155-
* Run apply - this step creates the resources in your AWS infrastructure
156-
```
157-
terraform apply
158-
```
159-
You should see:
160-
```
161-
Apply complete! Resources: 13 added, 0 changed, 0 destroyed.
162-
```
163-
164-
Note that creating the EMR cluster can take a while (typically 5 minutes).
165-
166-
Once applied, make a note of the following (you'll need to enter these as settings when configuring the Data Lake):
167-
* The **AWS Region** and **AWS Account ID** where your Data Lake was configured
168-
* The **Source ID and Slug** for _each_ Segment source that will be connected to the data lake
169-
* The generated **EMR Cluster ID**
170-
* The generated **IAM Role ARN**
171-
172-
# Common Errors
173-
174-
## The VPC/subnet configuration was invalid: No route to any external sources detected in Route Table for Subnet
175-
176-
```
177-
Error: Error applying plan:
178-
1 error(s) occurred:
179-
* module.emr.aws_emr_cluster.segment_data_lake_emr_cluster: 1 error(s) occurred:
180-
* aws_emr_cluster.segment_data_lake_emr_cluster: Error waiting for EMR Cluster state to be "WAITING" or "RUNNING": TERMINATED_WITH_ERRORS: VALIDATION_ERROR: The VPC/subnet configuration was invalid: No route to any external sources detected in Route Table for Subnet: subnet-{id} for VPC: vpc-{id}
181-
Terraform does not automatically rollback in the face of errors.
182-
Instead, your Terraform state file has been partially updated with
183-
any resources that successfully completed. Please address the error
184-
above and apply again to incrementally change your infrastructure.
185-
exit status 1
186-
```
187-
188-
The EMR cluster requires a route table attached to the subnet with an internet gateway. You can follow [this guide](https://aws.amazon.com/blogs/big-data/launching-and-running-an-amazon-emr-cluster-inside-a-vpc/) for guidance on creating and attaching a route table and internet gateway.
189-
190-
## The subnet configuration was invalid: The subnet subnet-{id} does not exist.
191-
192-
```
193-
Error: Error applying plan:
194-
1 error(s) occurred:
195-
* module.emr.aws_emr_cluster.segment_data_lake_emr_cluster: 1 error(s) occurred:
196-
* aws_emr_cluster.segment_data_lake_emr_cluster: Error waiting for EMR Cluster state to be "WAITING" or "RUNNING": TERMINATED_WITH_ERRORS: VALIDATION_ERROR: The subnet configuration was invalid: The subnet subnet-{id} does not exist.
197-
Terraform does not automatically rollback in the face of errors.
198-
Instead, your Terraform state file has been partially updated with
199-
any resources that successfully completed. Please address the error
200-
above and apply again to incrementally change your infrastructure.
201-
exit status 1
202-
```
203-
204-
The EMR cluster requires a subnet with a VPC. You can follow [this guide](https://aws.amazon.com/blogs/big-data/launching-and-running-an-amazon-emr-cluster-inside-a-vpc/) to create a subnet.
205-
206-
If all else fails, teardown and start over.
9+
If you want to create an Azure datalake refer to the modules and setup guide in [azure_datalake](azure_datalake) folder
20710

20811
# FAQ
20912
## Using workspaceID as the externalID
@@ -223,36 +26,6 @@ In order to support more versions of Terraform, the AWS Provider needs to held a
22326
as v3 has breaking changes we don't currently support. Our example `main.tf` has the
22427
code to accomplish this.
22528

226-
# Development
227-
228-
To develop in this repository, you'll want the following tools set up:
229-
230-
* [Terraform](https://www.terraform.io/downloads.html), >= 0.12 (note that 0.12 is used to develop this module, 0.11 is no longer supported)
231-
* [terraform-docs](https://github.com/segmentio/terraform-docs)
232-
* [tflint](https://github.com/terraform-linters/tflint), 0.15.x - 0.26.x
233-
* [Ruby](https://www.ruby-lang.org/en/documentation/installation/), [>= 2.4.2](https://rvm.io)
234-
* [Bundler](https://bundler.io)
235-
236-
To run unit tests, you also need an AWS account to be able to provision resources.
237-
238-
# Releasing
239-
240-
Releases are made from the master branch. First, make sure you have the last code from master pulled locally:
241-
242-
```
243-
* git remote update
244-
* git checkout master
245-
* git reset origin/master --hard
246-
```
247-
248-
Then, use [`git release`](https://github.com/tj/git-extras/blob/master/Commands.md#git-release) to cut a new version that follows [semver](https://semver.org):
249-
250-
```
251-
git release x.y.z
252-
```
253-
254-
Lastly, craft a new [Github release](https://github.com/segmentio/terraform-aws-data-lake/releases).
255-
25629
# License
25730

25831
Released under the [MIT License](https://opensource.org/licenses/MIT).

0 commit comments

Comments
 (0)