Skip to content

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue

Notifications You must be signed in to change notification settings

ev2900/Glue_Aggregate_Small_Files

Repository files navigation

Glue Aggregate Small Parquet Files

map-user map-user map-user

When storing data in S3 it is important to consider the size of files you store in S3. Parquet files have an ideal file size of 512 MB - 1 GB. Storing data in many small files can decrease the performance of data processing tools ie. Spark.

This repository provides a PySpark script Aggregate_Small_Parquet_Files.py that can consolidate small parquet files in an S3 prefix into larger parquet files.

How to run the Glue job to aggregate small parquet files

Note if you are testing the Aggregate_Small_Parquet_Files.py and need to generate small parquet files as test data. You can follow the instructions in the Example folder to create small file test data.

  1. Upload the Aggregate_Small_Parquet_Files.py file to a S3 bucket

  2. Run the CloudFormation stack below to create a Glue job that will generate small parquet files

Launch CloudFormation Stack

As you follow the prompts to deploy the CloudFormation stack ensure that you fill out the S3GlueScriptLocation parameter with the S3 URI of the Create_Small_Parquet_Files.py that you uploaded to a S3 bucket in the first step

cat_indicies_1

  1. Update and run the Glue job

The CloudFormation stack deployed a Glue job named Aggregate_Small_Parquet_Files. Navigate to the Glue console. Select ETL jobs and then the Aggregate_Small_Parquet_Files

Update <s3_bucket_name> with the name of the S3 bucket with the small files that need to be aggregated Update <path_to_prefix> with the path to the prefix of a single partition with small files to aggregate in it Optional: update the total_prefix_size to the desired target size of the aggregated parquet file(s)

cat_indicies_1

After you update the S3 bucket name and the path to the prefix, save and run the Glue job. When the Glue job finishes you will have small parquet files in the specified S3 location will have been aggregated.

About

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages