AWS Glue Data Catalog Replication Utility

This Utility is used to replicate Glue Data Catalog from one AWS account to another AWS account. Using this, you can replicate Databases, Tables, and Partitions from one source AWS account to one or more target AWS accounts. It uses AWS Glue APIs / AWS SDK for Java and serverless technologies such as AWS Lambda, Amazon SQS, and Amazon SNS. The architecture of this utility is shown in the following diagram.

Automated Deployment

Follow the instructions in this README.md to deploy this utility through CloudFormation in your AWS accounts. Otherwise follow the guide below for a manual deployment.

Build Instructions

The source code has Maven nature, you can build it using standard Maven commands e.g. mvn -X clean install. or use the options available in your IDE
The above step generates a Jar file e.g. aws-glue-data-catalog-replication-utility-1.0.0.jar

AWS Service Requirements

This utility requires the following AWS services

Source Account

3 AWS Lambda functions
3 Amazon DynamoDB tables
2 Amazon SNS Topics
1 Amazon SQS Queue
1 Amazon S3 Bucket

Each Target Account

3 AWS Lambda functions
2 Amazon DynamoDB tables
2 Amazon SQS Queues

Lambda Functions Overview

Class	Purpose
GDCReplicationPlannerLambda	Lambda function determines the list of databases to export. It is the driver program initiates the replication process.
ExportLambda	Lambda function to export databases and tables.
ExportLargeTableLambda	Lambda function to export large tables tables with more than 10 partitions.
ImportLambda	Lambda function to import databases and tables.
ImportLargeTableLambda	Lambda function to import large tables.
DLQProcessorLambda	Lambda function used to process errors generated by ImportLambda.

Deployment Instructions - Source Account

Create DynamoDB tables as defined in the following table

Table	Purpose	Schema	Capacity
glue_database_export_task	audit data for replication planner	Partition key - db_id (String), Sort key - export_run_id (Number)	On-Demand
db_status	audit data for databases exported	Partition key - db_id (String), Sort key - export_run_id (Number)	On-Demand
table_status	audit data for tables exported	Partition key - table_id (String), Sort key - export_run_id (Number)	On-Demand

Create two SNS Topics
1. Topic 1: Name = e.g. ReplicationPlannerSNSTopic
2. Topic 2: Name = e.g. SchemaDistributionSNSTopic
Create an S3 Bucket. It is used to save partitions for large tables (partitions > 10). This bucket must provide cross-account permissions to the IAM roles used by ImportLargeTable Lambda function in Target Account. Refer the following AWS resources for more details:
1. https://aws.amazon.com/premiumsupport/knowledge-center/cross-account-access-s3/
2. https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
Create one SQS Queue
1. Queue Name = e.g. LargeTableSQSQueue
2. Queue Type = Standard
3. Default Visibility Timeout = e.g. 3 minutes 15 seconds. Note: It must be higher than execution timeout of ExportLargeTable Lambda Function
Create Lambda Execution IAM Role and attach it to the Lambda functions deployed in Source Account. This role needs to have multiple permissions. Refer the following IAM policies to know about required permissions:
1. You can use AWS managed policy called AWSLambdaExecute (Policy ARN # arn:aws:iam::aws:policy/AWSLambdaExecute)
2. sample_sqs_policy_source_and_target_accounts
3. sample_sns_policy_source_account
4. sample_glue_policy_source_account
5. sample_ddb_policy_source_and_target_accounts

Deploy GDCReplicationPlannerLambda function

Runtime = Java 8
Function package = Use the Jar file generated. Refer section Build Instructions
Lambda Handler = com.amazonaws.gdcreplication.lambda.GDCReplicationPlanner
Timeout = e.g. 5 minutes
Memory = e.g. 128 MB
Environment variable = as defined in the following table

Variable Name	Variable Value
source_glue_catalog_id	Source AWS Account Id
ddb_name_gdc_replication_planner	Name of the DDB Table for glue_database_export_task of source account
database_prefix_list	List of database prefixes separated by a token. E.g. raw_data_,processed_data_. To export all databases, do not add this variable.
separator	The separator used in the database_prefix_list. E.g. ,. This can be skipped when database_prefix_list is not added.
region	e.g. us-east-1
sns_topic_arn_gdc_replication_planner	SNS Topic ARN for ReplicationPlannerSNSTopic

Deploy ExportLambda function

Runtime = Java 8
Function package = Use the Jar file generated. Refer section Build Instructions
Lambda Handler = com.amazonaws.gdcreplication.lambda.ExportDatabaseWithTables
Timeout = e.g. 5 minutes
Memory = e.g. 192 MB
Environment variable = as defined in the following table

Variable Name	Variable Value
source_glue_catalog_id	Source AWS Account Id
ddb_name_db_export_status	Name of the DDB Table for db_status of source account
ddb_name_table_export_status	Name of the DDB Table for table_status of source account
region	e.g. us-east-1
sns_topic_arn_export_dbs_tables	SNS Topic ARN for SchemaDistributionSNSTopic
sqs_queue_url_large_tables	SQS Queue URL for LargeTableSQSQueue

Add ReplicationPlannerSNSTopic as a trigger to ExportLambda function

Deploy ExportLargeTableLambda function

Runtime = Java 8
Function package = Use the Jar file generated. Refer section Build Instructions
Lambda Handler = com.amazonaws.gdcreplication.lambda.ExportLargeTable
Timeout = e.g. 3 minutes
Memory = e.g. 256 MB
Environment variable = as defined in the following table

Variable Name	Variable Value
s3_bucket_name	Name of the S3 Bucket used to save partitions for large Tables
ddb_name_table_export_status	Name of the DDB Table for table_status of source account
region	e.g. us-east-1
sns_topic_arn_export_dbs_tables	SNS Topic ARN for SchemaDistributionSNSTopic

Add LargeTableSQSQueue as a trigger to ExportLargeTableLambda function
1. Batch size = 1

Cross-Account permissions in Source Account. Grant permissions to Target Account to subscribe to the second SNS Topic:

aws sns add-permission --label lambda-access --aws-account-id TargetAccount \
--topic-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \
--action-name Subscribe ListSubscriptionsByTopic Receive

Deployment Instructions - Target Account

Create DynamoDB tables as defined in the following table

Table	Purpose	Schema	Capacity
db_status	audit data for databases imported	Partition key - db_id (String), Sort key - import_run_id (Number)	On-Demand
table_status	audit data for tables imported	Partition key - table_id (String), Sort key - import_run_id (Number)	On-Demand

Create SQS Queue
1. Queue Name = LargeTableSQSQueue
2. Queue Type = Standard
3. Default Visibility Timeout = e.g. 3 minutes 15 seconds. Note: It must be higher than execution timeout of ImportLargeTable Lambda Function
Create SQS Queue - dead letter queue processing
1. Queue Name = DeadLetterQueue
2. Queue Type = Standard
3. Default Visibility Timeout = e.g. 3 minutes 15 seconds
Create Lambda Execution IAM Role and attach it to the Lambda functions deployed in Target Account. This role needs to have multiple permissions. Refer the following IAM policies to know about required permissions:
1. You can use AWS managed policy called AWSLambdaExecute (Policy ARN # arn:aws:iam::aws:policy/AWSLambdaExecute)
2. sample_sqs_policy_source_and_target_accounts
3. sample_glue_policy_target_account
4. sample_ddb_policy_source_and_target_accounts

Deploy ImportLambda function

Runtime = Java 8
Function package = Use the Jar file generated. Refer section Build Instructions
Lambda Handler = com.amazonaws.gdcreplication.lambda.ImportDatabaseOrTable
Timeout = e.g. 5 minutes
Memory = e.g. 192 MB
Environment variable = as defined in the following table

Variable Name	Variable Value
target_glue_catalog_id	Target AWS Account Id
ddb_name_db_import_status	Name of the DDB Table for db_status of target account
ddb_name_table_import_status	Name of the DDB Table for table_status of target account
skip_archive	true
region	e.g. us-east-1
sqs_queue_url_large_tables	SQS Queue URL for LargeTableSQSQueue
dlq_url_sqs	SQS Queue URL for DeadLetterQueue

Give SchemaDistributionSNSTopic permissions to invoke Lambda function

aws lambda add-permission --function-name ImportLambda \
--source-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \
--statement-id sns-x-account --action "lambda:InvokeFunction" \
--principal sns.amazonaws.com

Subscribe ImportLambda function to SchemaDistributionSNSTopic

aws sns subscribe --protocol lambda \
--topic-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \
--notification-endpoint arn:aws:lambda:us-east-1:TargetAccount:function:ImportLambda

Additional References:

https://docs.aws.amazon.com/lambda/latest/dg/with-sns-example.html#with-sns-create-x-account-permissions

Deploy ImportLargeTableLambda function

Runtime = Java 8
Function package = Use the Jar file generated. Refer section Build Instructions
Lambda Handler = com.amazonaws.gdcreplication.lambda.ImportLargeTable
Timeout = e.g. 3 minutes
Memory = e.g. 256 MB
Environment variable = as defined in the following table

Variable Name	Variable Value
target_glue_catalog_id	Target AWS Account Id
ddb_name_table_import_status	Name of the DDB Table for table_status of target account
skip_archive	true
region	e.g. us-east-1

Add LargeTableSQSQueue as a trigger to ImportLargeTableLambda function
1. Batch size = 1

Deploy DLQProcessorLambda function

Runtime = Java 8
Function package = Use the Jar file generated. Refer section Build Instructions
Lambda Handler = com.amazonaws.gdcreplication.lambda.DLQImportDatabaseOrTable
Timeout = e.g. 3 minutes
Memory = e.g. 192 MB
Environment variable = as defined in the following table

Variable Name	Variable Value
target_glue_catalog_id	Target AWS Account Id
ddb_name_db_import_status	Name of the DDB Table for db_status of target account
ddb_name_table_import_status	Name of the DDB Table for table_status of target account
skip_archive	true
dlq_url_sqs	SQS Queue URL for DeadLetterQueue
region	e.g. us-east-1

Add Dead Letter SQS Queue as a trigger to DLQProcessorLambda Lambda function
1. Batch size = 1

Advantages

This solution was designed around 3 main tenets, which are simplicity, scalability, and cost-effectiveness. The following are direct benefits:

Target AWS accounts are independent allowing the solution to scale efficiently.
The target accounts always see the latest table information.
Light weight and dependable at scale.
The implementation is fully customizable.

Limitations

Following are the primary limitations:

This utility is NOT intended for real-time replication. Refer section Use Case 2 - Ongoing replication to know about how to run the replication process as a scheduled job.
This utility is NOT intended for two-way replication between AWS Accounts.
This utility does NOT attempt to resolve database and table name conflicts which may result in undesirable behavior.

Applicable Use Cases

Use Case 1: One-time replication

To do this, you can run GDCReplicationPlannerLambda function using a Test event in AWS Lambda console.

Use Case 2: Ongoing replication

To do this, you can create a CloudWatch Event Rule in Source Account and add GDCReplicationPlannerLambda as its target. Refer the following AWS documentation for more details:

Replication Mechanism in Target Account

For databases and tables, the actions taken by import Lambdas depend on the state of Glue Data Catalog in target account. Those actions are summarized in the following table.

Input Message Type	State of Target Glue Data Catalog	Action Taken in Target Glue Data Catalog
Database	Database exist already	Skip the message
Database	Database does not exist	Create Database
Table	Table exist already	Update Table
Table	Table does not exist	Create Table

For partitions, the actions are summarized in the following table:

Partitions in Export	State in Target Glue Data Catalog	Action Taken in Target Account
Partitions DO NOT exist	Target Table has no partitions	No action taken
Partitions DO NOT exist	Target Table has partitions	Delete current partitions
Partitions exist	Target Table has no partitions	Create new partitions
Partitions exist	Target Table has partitions	Delete current partitions, create new partitions

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
automated-deployment		automated-deployment
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

License

aws-samples/aws-glue-data-catalog-replication-utility

Folders and files

Latest commit

History

Repository files navigation

AWS Glue Data Catalog Replication Utility

Automated Deployment

Build Instructions

AWS Service Requirements

Source Account

Each Target Account

Lambda Functions Overview

Deployment Instructions - Source Account

Deployment Instructions - Target Account

Advantages

Limitations

Applicable Use Cases

Use Case 1: One-time replication

Use Case 2: Ongoing replication

Replication Mechanism in Target Account

License Summary

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages