You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Ability to run Terraform with your AWS Account. Terraform 0.12+ (you can download tfswitch to help with switching your terraform version)
9
-
* A subnet within a VPC for the EMR cluster to run in.
10
-
* An [S3 Bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) for Segment to load data into. You can create a new one just for this, or re-use an existing one you already have.
11
-
12
-
## VPC
13
-
14
-
You'll need to provide a subnet within a VPC for the EMR to cluster to run in. Here are some resources that can guide you through setting up a VPC for your EMR cluster:
The repository is split into multiple modules, and each can be used independently:
22
-
*[iam](/modules/iam) - IAM roles that give Segment access to your AWS resources.
23
-
*[emr](/modules/emr) - EMR cluster that Segment can submit jobs to load events into your Data Lake.
24
-
*[glue](/modules/glue) - Glue tables that Segment can write metadata to.
3
+
Terraform modules which create AWS or Azure resources for a Segment Data Lake as required.
25
4
26
5
# Usage
27
6
28
-
## Terraform Installation
29
-
*Note* - Skip this section if you already have a working Terraform setup
30
-
### OSX:
31
-
`brew` on OSX should install the latest version of Terraform.
32
-
```
33
-
brew install terraform
34
-
```
35
-
36
-
### Centos/Ubuntu:
37
-
* Follow instructions [here](https://phoenixnap.com/kb/how-to-install-terraform-centos-ubuntu) to install on Centos/Ubuntu OS.
38
-
* Ensure that the version installed in > 0.11.x
39
-
40
-
Verify installation works by running:
41
-
```
42
-
terraform help
43
-
```
44
-
45
-
## Set up Project
46
-
* Create project directory
47
-
```
48
-
mkdir segment-datalakes-tf
49
-
```
50
-
* Create `main.tf` file
51
-
* Update the `external_ids` variable in the `locals` section to the workspace ID. This will allow all sources in the workspace to be synced to the Data Lake without any extra setup.
52
-
***Note - Existing users** may be using the `sourceID` here instead of the `workspaceID`, which was the previous configuration. Setting this value as the `sourceID` is still supported for existing users for backwards compatibility. Follow instructions [here](#Using-workspaceID-as-the-externalID) to migrate to `workspaceID`. This will ensure you do not need to update this value for each source you want to add.
53
-
* Update the `name` in the `aws_s3_bucket` resource to the desired name of your S3 bucket
54
-
* Update the `subnet_id` in the `emr` module to the subnet in which to create the EMR cluster
55
-
56
-
```hcl
57
-
provider "aws" {
58
-
# Replace this with the AWS region your infrastructure is set up in.
59
-
region = "us-west-2"
60
-
61
-
version = "~> 4"
62
-
}
63
-
64
-
locals {
65
-
external_ids = {
66
-
# Find these in the Segment UI. Only need to set this once for all sources in
67
-
# the workspace
68
-
# - Settings > General Settings
69
-
<Workspace Name> = "<Workspace ID>"
70
-
}
71
-
72
-
# Add the names of the glue databases you want to setup lake-formation for.
73
-
74
-
# glue_db_list = toset([<db_names>])
75
-
76
-
}
77
-
78
-
# This is the target where Segment will write your data.
79
-
# You can skip this if you already have an S3 bucket and just reference that name manually later.
80
-
# If you decide to skip this and use an existing bucket, ensure that you attach a 14 day expiration lifecycle policy to
81
-
# your S3 bucket for the "segment-stage/" prefix.
82
-
resource "aws_s3_bucket" "segment_datalake_s3" {
83
-
bucket = "my-first-segment-datalake"
84
-
}
85
-
86
-
# Creates the IAM Policy that allows Segment to access the necessary resources
* aws_emr_cluster.segment_data_lake_emr_cluster: Error waiting for EMR Cluster state to be "WAITING" or "RUNNING": TERMINATED_WITH_ERRORS: VALIDATION_ERROR: The VPC/subnet configuration was invalid: No route to any external sources detected in Route Table for Subnet: subnet-{id} for VPC: vpc-{id}
181
-
Terraform does not automatically rollback in the face of errors.
182
-
Instead, your Terraform state file has been partially updated with
183
-
any resources that successfully completed. Please address the error
184
-
above and apply again to incrementally change your infrastructure.
185
-
exit status 1
186
-
```
187
-
188
-
The EMR cluster requires a route table attached to the subnet with an internet gateway. You can follow [this guide](https://aws.amazon.com/blogs/big-data/launching-and-running-an-amazon-emr-cluster-inside-a-vpc/) for guidance on creating and attaching a route table and internet gateway.
189
-
190
-
## The subnet configuration was invalid: The subnet subnet-{id} does not exist.
* aws_emr_cluster.segment_data_lake_emr_cluster: Error waiting for EMR Cluster state to be "WAITING" or "RUNNING": TERMINATED_WITH_ERRORS: VALIDATION_ERROR: The subnet configuration was invalid: The subnet subnet-{id} does not exist.
197
-
Terraform does not automatically rollback in the face of errors.
198
-
Instead, your Terraform state file has been partially updated with
199
-
any resources that successfully completed. Please address the error
200
-
above and apply again to incrementally change your infrastructure.
201
-
exit status 1
202
-
```
203
-
204
-
The EMR cluster requires a subnet with a VPC. You can follow [this guide](https://aws.amazon.com/blogs/big-data/launching-and-running-an-amazon-emr-cluster-inside-a-vpc/) to create a subnet.
205
-
206
-
If all else fails, teardown and start over.
9
+
If you want to create an Azure datalake refer to the modules and setup guide in [azure_datalake](azure_datalake) folder
207
10
208
11
# FAQ
209
12
## Using workspaceID as the externalID
@@ -223,36 +26,6 @@ In order to support more versions of Terraform, the AWS Provider needs to held a
223
26
as v3 has breaking changes we don't currently support. Our example `main.tf` has the
224
27
code to accomplish this.
225
28
226
-
# Development
227
-
228
-
To develop in this repository, you'll want the following tools set up:
229
-
230
-
*[Terraform](https://www.terraform.io/downloads.html), >= 0.12 (note that 0.12 is used to develop this module, 0.11 is no longer supported)
To run unit tests, you also need an AWS account to be able to provision resources.
237
-
238
-
# Releasing
239
-
240
-
Releases are made from the master branch. First, make sure you have the last code from master pulled locally:
241
-
242
-
```
243
-
* git remote update
244
-
* git checkout master
245
-
* git reset origin/master --hard
246
-
```
247
-
248
-
Then, use [`git release`](https://github.com/tj/git-extras/blob/master/Commands.md#git-release) to cut a new version that follows [semver](https://semver.org):
249
-
250
-
```
251
-
git release x.y.z
252
-
```
253
-
254
-
Lastly, craft a new [Github release](https://github.com/segmentio/terraform-aws-data-lake/releases).
255
-
256
29
# License
257
30
258
31
Released under the [MIT License](https://opensource.org/licenses/MIT).
0 commit comments