Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on writes/merges #2522

Open
echai58 opened this issue May 17, 2024 · 12 comments
Open

Memory leak on writes/merges #2522

echai58 opened this issue May 17, 2024 · 12 comments
Assignees
Labels
binding/rust Issues for the Rust crate question Further information is requested

Comments

@echai58
Copy link

echai58 commented May 17, 2024

Environment

Binding: python


Bug

What happened:
We're noticing constantly rising memory in our processes that write to deltalake. I wrote a minimal reproduction that loops and writes to deltalake, and the memory usage seems to indicate a memory leak.

What you expected to happen:
Memory to be reclaimed after writes.

How to reproduce it:
This is my script I tested with:

import datetime
from deltalake import write_deltalake, DeltaTable
import numpy as np
import pandas as pd 
import pyarrow as pa

schema = pa.schema(
    [
        ("date", pa.date32()),
        ("c1", pa.string()),
        ("c2", pa.int32()),
        ("v1", pa.float64()),
        ("v2", pa.float64()),
        ("v3", pa.int32()),
    ]
)

def write_table(dt: DeltaTable, date: datetime.date):
    num_rows = 150
    data = pd.DataFrame(
        {
            "date": np.repeat(date, num_rows),
            "c1": [f"str_{i}" for i in np.random.randint(0, 100, num_rows)],
            "c2": np.random.randint(1, 100, num_rows),
            "v1": np.random.rand(num_rows) * 100,
            "v2": np.random.rand(num_rows) * 100,
            "v3": np.random.randint(0, 100, num_rows),
        }
    )
    dt.merge(
        pa.Table.from_pandas(data, schema=schema),
        "s.date = t.date and s.c1 = t.c1 and s.c2 = t.c2",
        source_alias="s",
        target_alias="t",
    ).when_matched_update_all().when_not_matched_insert_all().execute()
    if dt.version() % 10 == 0:
        dt.create_checkpoint()

def main():
    DeltaTable.create("test", schema)

    for date in pd.date_range("2022-01-01", periods=25):
        for _ in range(50):
            write_table(DeltaTable("test"), date.date())

if __name__ == "__main__":
    main()

More details:
Here's the memray graph:
image

I also tested this with just write_deltalake(mode="append"), and the issue seems to also persist:
image

I saw #2068 and tried setting that env var, and got the following (doesn't seem to help):
image

@echai58 echai58 added the bug Something isn't working label May 17, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented May 17, 2024

Can you do the following, change your script to sleep first for 30 secs, then create the table and then sleep again for 30 secs, then start writing in a loop those 50 times?

I think the slow increase in resident size might just be because at every write you update the table state at the end of the commit since it includes more info now.

@rtyler rtyler added the binding/python Issues for the Python package label May 17, 2024
@echai58
Copy link
Author

echai58 commented May 20, 2024

here's the mem graph with the 30 second sleeps:

image

@echai58
Copy link
Author

echai58 commented May 20, 2024

@ion-elgreco

This is a graph for 250 merges:
image

and for 250 appends:
image

The slow increase in the 250 appends I think would correspond to the increased table state metadata. But the merge memory use rises much faster, so maybe it is something particular with merges?

@ion-elgreco
Copy link
Collaborator

Merge operations probably holds more info, so this looks normal to me

@ion-elgreco ion-elgreco added question Further information is requested and removed bug Something isn't working binding/python Issues for the Python package labels May 20, 2024
@echai58
Copy link
Author

echai58 commented May 21, 2024

@ion-elgreco The _last_checkpoint file says:
{"size":360,"size_in_bytes":75770,"version":100}

for the last checkpoint after running a script with 1000 merges resulting in the following memory increase:
image

The amount the memory increased seems much larger than the metadata

@ion-elgreco
Copy link
Collaborator

@echai58 the checkpoint is compressed and also would never translate 1:1 from disk to mem afaik

@echai58
Copy link
Author

echai58 commented May 21, 2024

@ion-elgreco profiling a script that just instantiates the same delta table gives the following:
image

~13 mb , which is still much less than the >100mb seen from the merge script

@malachibazar
Copy link

malachibazar commented Jul 16, 2024

Has there been any progress on this? I'm experiencing the same issue with merges.

@rtyler rtyler added the binding/rust Issues for the Rust crate label Aug 10, 2024
@rtyler
Copy link
Member

rtyler commented Aug 10, 2024

I have a theory that this might be related to some of the performance issues that @roeap and I were hunting after this week. He's got some fixes in mind after which we can look into this specific issue some more

@aldder
Copy link

aldder commented Dec 31, 2024

Hello, any update for this topic?

@kapteinkaald
Copy link

kapteinkaald commented Feb 10, 2025

I think we have the same issue here.

@ion-elgreco
Copy link
Collaborator

There are a couple features in main that have improved the memory usage:

@rtyler is doing some testing at the moment on the memory perf. The numbers are promising imho, even if it's a micro benchmark for normal appends. Before these PRs in main it was around 500-700MB memory usage. Now it's at a stable 500MB, with the WIP PR #3196 using the LazyTableProvider it's at 30-50MB ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants