-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate IO performance issue with workers and object-store-backed datasets #637
Comments
Caching may help, but memory configuration and VM vs container vs bare metal may make a difference. Totally different products, but CEPH had issues if it had too much RAM. It had something to do with data replication. Don't remember - it was a nightmare. Reading and writing from the object store may also cause problems due to rapid io on the part of ngen. If there's just a little bit of extra overhead on the object store calls and ngen is opening and closing a bajillion files, it might be crippling. If it comes down to it, a pre and post processing step to stage the required data locally might be a solution. That'll be its own nightmare. |
This is basically my initial guess for the problem: too much IO, applied to only a single MinIO node now (due to other restrictions; see #242 et al.), running on older, less robust hardware. It's not out of the question that the object store proxy or Docker storage plugin configs could be sub-optimal, or that something else is going on, but I'm not getting my hopes up.
I suspect the best way to do this would actually be to just support another dataset type and add code to the data service to transform to/from such dataset, similarly to how we do other dataset derivations now. And that should also make it easy to do such things as operations unto themselves. Doing things purely within the job workers would probably be slower, would be less reusable, and might compel us to add other complicated functionality like managing and allocating disk space like we do CPUs. |
After running another experiment yesterday, I am a little more concerned that the choke point is actually the s3fs Docker plugin itself needed for the volumes. I also am considering an alternative to just the plugin part of the object store dataset setup. The experiment was done using my local, 2-node deployment, with the MinIO and proxy service running on one node and the ngen job constrained to the other. I also left forcing out, using a local bind mount for that, but the other datasets - in particular the output dataset being written to - were s3fs plugin volumes. When things actually got to executing timesteps, I observed that the I have a few ideas. The first few involve finding ways to get the volume plugin more than 1 CPU. Attempting to look, I haven't been able to quickly find a simple way to allocate more resources to a plugin. I've thought about using several plugins via aliasing, but the I don't think there's a good way to break up things for one dataset that way, and the writes to the output dataset seem to be what's causing the load. I've started looking at further custom modifications to the Docker plugin, though only just started. It occurred to me though that perhaps we should consider moving away from the plugin entirely and back to FUSE mounts directly within the container. This was the original design for using object store datasets in the workers. I suspect performance would be better, and I'm very curious how much. Of course, there was a reason why we moved away from this: it might not be feasible for some environments. The current plugin-based method is more universally compatible, so we would need to continue to support that and add a way to configure which mechanism for mounting object store datasets would be used by a deployment. And we might want to combine this idea with some of the other pieces also: e.g., dividing datasets across multiple plugins via aliasing, if the plugin method was being used. |
I was also curious about this. Here is what I've done so far: To start, the goal of my exploration / work was to benchmark IO performance of an The aim for this initial investigation is to keep things simple, but reproducible, so that it is easy to run these benchmarks elsewhere. With that in mind, here are the hardware specs for the machine I used for testing:
All testing was done via
For these tests, the docker desktop vm was allocated 64GB of virtual disk, 4 vCPUs, 12GB of memory, and 4GB of swap. The "methodology" for the benchmark is pretty straightforward:
Glossing over the first two steps for now, after doing some searching, I found several mentions of TLDR; use The Open two terminal sessions and follow these steps to run the benchmark: docker network create iozone_net
mkdir data
docker run \
--rm \
--name=minio \
--net iozone_net \
--expose=9000 --expose=9001 \
-p='9000:9000' -p '9001:9001' \
-v $(pwd)/data:/export minio/minio:RELEASE.2022-10-08T20-11-00Z \
server /export --console-address ":9001" In a second terminal then run: mc mb local/iozone
docker run -it --rm \
--net iozone_net \
--device /dev/fuse \
--cap-add SYS_ADMIN \
--security-opt "apparmor=unconfined" \
--env 'AWS_ACCESS_KEY_ID=minioadmin' \
--env 'AWS_SECRET_ACCESS_KEY=minioadmin' \
--env UID=$(id -u) \
--env GID=$(id -g) \
--entrypoint=/bin/sh \
efrecon/s3fs:1.94
# everything below is run from within the running container
arch=$(uname -m)
wget "http://dl-cdn.alpinelinux.org/alpine/v3.19/community/${arch}/iozone-3.506-r0.apk"
apk add --allow-untrusted iozone-3.506-r0.apk
mkdir iozone_bench
s3fs iozone /opt/s3fs/iozone_bench \
-o url=http://minio:9000/ \
-o use_path_request_style
cd iozone_bench
iozone -Rac
You will need to import io
import pathlib
import pandas as pd
FILENAME = "./excel_out.tsv"
data_file = pathlib.Path(FILENAME)
data = data_file.read_text()
with pd.ExcelWriter(f"{data_file.stem}.xlsx") as writer:
for sheet in data.split("\n\n"):
name_idx = sheet.find("\n")
name = sheet[:name_idx].strip('"')
sheet_data = sheet[name_idx+1:]
df = pd.read_table(io.StringIO(sheet_data))
df.rename(columns={"Unnamed: 0": "File Size Kb"}, inplace=True)
df.set_index("File Size Kb", inplace=True)
df.columns.name = "Record Size Kb"
df = df / 1024 # translate records from Kb/sec to Mb/sec
df.to_excel(writer, sheet_name=name) The output spreadsheet will have sheets for each of the io benchmarks (e.g. Writer report).
Writer report:
Reader report:
For comparison sake, here are "baseline" Baseline Writer report:
Baseline Reader report:
I was not overly shocked to see that there is over an order of magnitude difference between the fuse mount and local storage across the board, but I also didn't think local performance would be this poor. The fastest fuse mount combination for
|
Next, I took a look at the There is a substantial drop off in both writing and reading performance compared to the more direct
As a sanity check I ran a benchmark using |
A few more observations during experiments, with respect to job output CSV files totaling 2.3 G in size:
So we appear to be paying a significant penalty for the plugin, but it is no where near the penalty paid for dealing with multiple files. In particular, this suggests the addition of NetCDF output from ngen (see NOAA-OWP/ngen#714 and NOAA-OWP/ngen#744) will be particularly useful. |
I got curious over lunch about how Notice the 1K file size 1K record size ~16 Gb/sec. This is nearly double the performance of |
1K is roughly the MTU of a TCP packet so that might make sense? The performance really drops off in adjacent file size / record size combinations though. |
I just found out that models previously had to have all their inputs duplicated and copied to different nodes for MPI to work prior to the availability of parallel file systems. As unpalatteable as it might be to rely on large data duplication, that might be the safest/most efficient way of running models via dmod. |
In recent testing, analogous jobs took several orders of magnitude longer when workers used object-store-backed datasets compared to if they used data supplied via local (manually synced) bind mounts. Rough numbers for one example are 5 hours compared to 5 minutes.
These tests involved a single node with 24 processors and 96 GB of memory, running both the services and workers. I.e., it's entirely possible the performance deficiency is heavily related to the compute infrastructure being insufficient for running jobs and an object store like this (initial development of these features involved 2-5 data-center-class servers). On the other hand, this is not an unrepresentative use-case for what we want DMOD to do.
Some investigation is needed as to whether there are any viable means to at least reasonably mitigate these issues (i.e., longer runtimes on lesser hardware are to be expected, but going from 5 minutes to 4 hours just isn't acceptable). Unless these performance issues are drastically reduced, we should likely accelerate plans for developing alternative dataset backends, such as cluster volumes within the Docker Swarm (see #593).
The text was updated successfully, but these errors were encountered: