This architecture showcases how Amazon Athena SQL queries can be executed via AWS Lambda using the Boto3 API. Additionally, this architecture can be fully deployed using AWS CDK and is designed to fit into a larger serverless architecture. The CDK stack configures and deploys a Lambda function with the appropriate IAM permissions to make Athena SQL queries on an S3 bucket. The query results can then be found in an S3 output bucket specified by the user. This architecture can be used if Athena queries need to be run on a regular, scheduled basis.
- An active AWS account
- An Amazon Simple Storage Service (Amazon S3) with pre-existing data
- Data to be queried by Athena should be available in an S3 bucket.
- Amazon S3 data is cataloged via AWS Glue
- This can be done using a Glue crawler. For more information regarding this, refer to Using AWS Glue to connect to data sources in Amazon S3 from the Amazon Athena documentation.
- Default output S3 bucket for Amazon Athena has been set
- Before running any queries in Athena, an output S3 bucket location in the same region must be set in Athena settings. For more information regarding this, refer to Specifying a query result location from the Amazon Athena documentation.
- Amazon Athena workgroup
- If you do not have an existing Athena workgroup to use for querying, follow Setting up workgroups from the Amazon Athena documentation. We recommend using a workgroup that only has access to the tables used in the query.
- Familiarity with deploying AWS resources using AWS CDK.
- For more information regarding this, refer to the AWS CDK Workshop.
- S3 bucket (prerequisite) — contains data to be queries
- Lambda function — executes Athena SQL queries via Boto3 API
- IAM role for Lambda function — Lambda execution role with the proper permissions to query S3 via Athena and save results to specified S3 location. This role contains an access policy that follows the principal of least-privilege
- Note: For simplicity, the input and output buckets are configured to be the same in this pattern. However, the user can optionally specify separate input and output buckets in the CDK code.
AWS Lambda can be run on-demand or can be configured to run on a schedule using CloudWatch Events.
Clone this repo and configure the //TODO
portions of the code found in the lib/athena-queries-via-lambda-stack.ts
file with proper variables from your AWS environment
The cdk.json
file tells the CDK Toolkit how to execute your app.
Before getting ready to deploy, ensure the dependencies are installed by executing the following within the root folder of your code files:
npm install -g aws-cdk
npm install
npm run build
Note: The above commands should be run within the root folder containing the cdk.json
file
This stack uses assets, so the toolkit stack must be deployed to the environment. This can be done by running the following command:
cdk bootstrap aws://your-aws-account-id/your-specified-aws-region
At this point, you can now synthesize the CloudFormation template for this code by running the following command:
cdk synth
Finally, to deploy the stack to your AWS environment run the following command:
cdk deploy
Navigate to the AWS Lambda console and look for the function created by the CDK stack. It should be named something like CdkStack-queryAthena
followed by a series of numbers and letters.
Click on the Lambda function and open the “Configuration” tab. Next, click on “Environment Variables”. The environment variables should match what you filled out in the //TODO
sections in the CDK code.
If no test event exists for the Lambda function, create a new test event (fine to use the default, pre-populated JSON event). Click on “Test” and ensure the Lambda function executes successfully.
Next, navigate to the S3 bucket specified as the output location for Athena query results. Check that files have been saved to the specified output folder. Additionally, you can locally download the output file to verify the specific Athena SQL query.
Navigate to the root folder of the code files and run the following:
cdk destroy
This will destroy all the cloud infrastructure deployed by the CDK stack.
- Amazon Simple Storage Service (Amazon S3) — used for data storage
- AWS Lambda — serverless compute service, makes the API call to Athena
- AWS Cloud Development Kit (CDK) — software development framework used to provision cloud resource
- Amazon Athena (indirectly) — serverless, interactive analytics service, executes SQL query on S3
- AWS Glue (prerequisite) — data catalog of available data, contains metadata for tables queried by Athena
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
- Siddharth Kumaran -- Assoc. Machine Learning Engineer @ AWS Professional Services
- Ritika Raju -- Assoc. Cloud App Developer @ AWS Professional Services
- Isabelle Imacseng -- Data & ML Engineer @ AWS Professional Services
- Radhika Tallamraju -- Data & ML Engineer @ AWS Professional Services