-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Select causing high memory consumption on >3GiB file #20430
Comments
An array is effectively one object. There is no streaming possible on this. Use NDJSON with separate entries. If any change comes from this, it will be limiting single object sizes. So this will not end up working either way. |
FWIW AWS limits to 1MB objects: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html |
Closes minio#20430 Limit allocations from badly formed documents.
Hi @klauspost, Thanks for replying so quickly. I will try with NDJSON as each record in the dataset is way below the 1MB threshold :) |
When running a select statement on a gzip json dataset (~3 GiB), the memory usage spikes significantly more than I would expect.
Expected Behavior
I would expect the memory consumption of an S3 Select operation to be relative to the limit on the sql command, and that stream data processing is used, so the entire object is not loaded into memory.
Current Behavior
I'm using the mc client to run an S3 Select statement on my ~3GiB gzip json file (11.5 GiB uncompressed). The file is a json array (i.e. not NDJSON), so I query it with the following command:
The command hangs for a while, before I can see the minio pod is OOMKilled - This is running on a 32GiB box, where all other workloads have been removed, so it is free to use all the resources.
I tried this on a 64GiB box, which worked, but the memory stayed high after the command returned a response, for longer than I'd expect (as seen in the diagram of memory usage).
Possible Solution
¯_(ツ)_/¯
Steps to Reproduce (for bugs)
[{ "a": 1, ... }, ...]
, and upload to bucketContext
I want to allow users to query parts of the dataset in a language they are familiar with (SQL).
Regression
Not sure, but I have seen this issue before which I thought could be related: #17235
Your Environment
Version used (
minio --version
):minio version DEVELOPMENT.2024-09-09T16-59-28Z (commit-id=0b7aa6af879e088030d63e91a29dabf22fdd3a18)
Runtime: go1.22.7 linux/amd64
License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
Copyright: 2015-2024 MinIO, Inc.
Server setup and configuration: Deployed to an AKS cluster using the bitnami minio helm chart version 14.7.8 with default config, but using nodeSelector and tolerations to force it onto a node with no workloads running.
Operating System and version (
uname -a
): Node image version: AKSUbuntu-2204gen2containerd-202409.04.0The text was updated successfully, but these errors were encountered: