This repo is a proof-of-concept serverless data lake built on AWS. All the ETL jobs (Lambdas) are 100% written in Go, and the CI/CD pipeline is implemented in CDK TypeScript. Go was not used for the CDK component as, at the time of writing, it has not implemented enough features for this POC to be viable.
This POC ingests a CSV file detailing Russian equipment losses in the current Russia-Ukraine conflict. When the file is uploaded to the Landing
zone it is converted to Parquet and written to Curated
, then a second Lambda picks it up and writes all rows to a DynamoDB table. All of this is just to test the ability and maturity of Go for cloud-native data wrangling, we are not doing any fancy or meaningful data science here.
Our test data set has been included in the data/
folder so you can replicate this Data Lake in your own environment.
Huge shoutout to the community-driven open source packages that make projects like this viable:
To develop on this, simply start hacking. All Lambda source code is in the src/
directory, with each sub directory specifying a different Lambda function.
The stacks defined in lib/
directory are the core of the CDK application that actually creates and deploys resources to AWS.
First auth to your AWS environment, then make sure you have Docker running locally. Finally, just run:
cdk deploy
Amazing!
npm run build
compile typescript to jsnpm run watch
watch for changes and compilenpm run test
perform the jest unit testscdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation template