Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Select causing high memory consumption on >3GiB file #20430

Closed
OllieRushton opened this issue Sep 13, 2024 · 3 comments · Fixed by #20439
Closed

S3 Select causing high memory consumption on >3GiB file #20430

OllieRushton opened this issue Sep 13, 2024 · 3 comments · Fixed by #20439
Assignees

Comments

@OllieRushton
Copy link

OllieRushton commented Sep 13, 2024

When running a select statement on a gzip json dataset (~3 GiB), the memory usage spikes significantly more than I would expect.

image

Expected Behavior

I would expect the memory consumption of an S3 Select operation to be relative to the limit on the sql command, and that stream data processing is used, so the entire object is not loaded into memory.

Current Behavior

I'm using the mc client to run an S3 Select statement on my ~3GiB gzip json file (11.5 GiB uncompressed). The file is a json array (i.e. not NDJSON), so I query it with the following command:

mc sql myminio/mybucket/dataset.json --query "SELECT * FROM S3Object[*]._1[*] s LIMIT 10" --json-input type=DOCUMENT --compression GZIP

The command hangs for a while, before I can see the minio pod is OOMKilled - This is running on a 32GiB box, where all other workloads have been removed, so it is free to use all the resources.

I tried this on a 64GiB box, which worked, but the memory stayed high after the command returned a response, for longer than I'd expect (as seen in the diagram of memory usage).

Possible Solution

¯_(ツ)_/¯

Steps to Reproduce (for bugs)

  1. Setup Minio on a box with 32GiB memory.
  2. Create a 3GiB Gzip JSON file, storing a json array of records: i.e. [{ "a": 1, ... }, ...], and upload to bucket
  3. Run a select command:
mc sql myminio/mybucket/dataset.json --query "SELECT * FROM S3Object[*]._1[*] s LIMIT 10" --json-input type=DOCUMENT --compression GZIP

Context

I want to allow users to query parts of the dataset in a language they are familiar with (SQL).

Regression

Not sure, but I have seen this issue before which I thought could be related: #17235

Your Environment

  • Version used (minio --version):
    minio version DEVELOPMENT.2024-09-09T16-59-28Z (commit-id=0b7aa6af879e088030d63e91a29dabf22fdd3a18)
    Runtime: go1.22.7 linux/amd64
    License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
    Copyright: 2015-2024 MinIO, Inc.

  • Server setup and configuration: Deployed to an AKS cluster using the bitnami minio helm chart version 14.7.8 with default config, but using nodeSelector and tolerations to force it onto a node with no workloads running.

  • Operating System and version (uname -a): Node image version: AKSUbuntu-2204gen2containerd-202409.04.0

@klauspost
Copy link
Contributor

klauspost commented Sep 16, 2024

An array is effectively one object. There is no streaming possible on this.

Use NDJSON with separate entries. If any change comes from this, it will be limiting single object sizes. So this will not end up working either way.

@klauspost
Copy link
Contributor

klauspost added a commit to klauspost/minio that referenced this issue Sep 16, 2024
Closes minio#20430

Limit allocations from badly formed documents.
@OllieRushton
Copy link
Author

OllieRushton commented Sep 17, 2024

Hi @klauspost, Thanks for replying so quickly. I will try with NDJSON as each record in the dataset is way below the 1MB threshold :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants