Feature/dask distributed #152

tomreitz · 2025-02-20T23:16:58Z

This DRAFT PR is not yet ready for merge, however I want to explain what the work represents.

earthmover currently uses Dask (as opposed to Pandas, or other DataFrame libraries) for essentially one reason only: the ability to handle larger-then-memory data by spilling to disk. However, earthmover's use of Dask is single-threaded... if you run earthmover on a large machine with (say) 32 cores, only one core will be used to do transformation.

Some transformation workloads you might want to run with earthmover are memory-bound: such workloads wouldn't benefit much or possibly at all from parallelizing Dask. However, some transformation workloads are compute-bound: such workloads could benefit from parallelization - possibly a lot.

This branch uses Dask's LocalCluster to parallelize workloads across a configurable number of concurrent workers (assigned to different cores on the machine). Initial tests on my laptop, with 16 CPU cores and 8GB memory, seem to show that this can make some workloads run faster.

In example_projects/01_simple/ we have a big_earthmover.yaml which does a simple transformation on a 1B-row, 3.2GB TSV file and produces a 1B-line 27.9GB JSONL file.

Under the most-recent released version of earthmover, this runs in 71 minutes.
Under this branch, with the configured 4 cores with 1.2 GB memory each, it ran in 49 minutes (30% improvement).

To get this to work, several things had to be refactored in earthmover, especially because Dask parallelizes by serializing/pickling data and objects when it ships them to workers on another core. Some of the specific changes that were needed:

use Python partials instead of lambda expressions (which are not serializable)
move Jinja template compilation into each partition (the method called by .map_partitions()) to deal with the fact that compiled Jinja templates are not serializable
read input files with a smaller-than-default blocksize, since they get wider (and larger) when turned into JSONL in a destination
write output files with df.to_csv(self.file, single_file=True, ...) to avoid each thread blocking on I/O
remove extra * kwargs from several methods, which prevent object serialization

See the further comments below for performance analysis of earthmover distributed, and for details about how earthmover works with Dask.

Further work to wrap up here includes:

~~fixing the Package class and deps building process (currently commented out) which still throws Dask serialization errors~~
document and determine reasonable defaults for earthmover.yml config.dask and dask_cluster_kwargs (possibly dynamic, based on host machine specs)

tomreitz · 2025-03-12T18:38:22Z

Since writing the PR description above, I've been testing this branch with various other workloads, which has helped to shed light on issues with other operations. The big_earthmover.yml project, while a good test of a certain kind of scaling, is an incredibly simple transformation - it doesn't change the grain of the data at all, it simply

maps values of one column
renames one column
adds one column

When I started testing with other transformations, I found and had to fix issues with join and group_by, among others. In this comment, I will share some performance data for various workloads I've tested.

First, an update on big_earthmover.yml (the 1B-row, 3.2GB TSV input file that produces a 1B-line 27.9GB JSONL output file). In addition to the previous two data points:

Under the most-recent released version of earthmover, this runs in 71 minutes.

Under this branch, with the configured 4 cores with 1.2 GB memory each, it ran in 49 minutes (30% improvement).

I also ran with two other configurations:

Under this branch, with 2 cores with 2.4 GB memory each, it ran in 42 minutes (41% improvement over current main).
Under this branch, with 6 cores with 800 MB memory each, it ran in 51 minutes (28% improvement over current main).

Now, turning to other workloads: I used the TPC-H datasets and queries I wrote about here. Below I briefly explain each query/workload, and then give the performance figures. At the end, I'll discuss some takeaways and next steps. Finally, in a separate comment, I'll give some detail about how various operations work under-the-hood and why some workloads benefit from dask distributed while others do not.

Background

TPC-H is a standard data schema representing a business's operations: parts, suppliers, orders, customers, etc. It has been used for decades to test relational database systems. For testing earthmover, I used tcph-kit to generate various (doubling) sizes (from about 10MB to over 20GB) of the TPC-H dataset. I then created earthmover transformation YAML for the first 5 of the canonical TPC-H queries.