Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to remove the cached memory by DALI #5816

Open
1 task done
zmasih opened this issue Feb 10, 2025 · 1 comment
Open
1 task done

how to remove the cached memory by DALI #5816

zmasih opened this issue Feb 10, 2025 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@zmasih
Copy link

zmasih commented Feb 10, 2025

Describe the question.

I am working on optimizing a DALI pipeline for benchmarking data-loading performance and want to clarify DALI’s memory management and caching behavior.

Based on the documentation, I understand that:
DALI does not release memory to the system but instead reuses it via a global memory pool.
Deleting a pipeline does not free memory, but it allows a new pipeline to reuse the allocated memory.
Pipeline recreation adds significant overhead due to initialization costs.
DALI readers (e.g., fn.readers.tfrecord) may have internal caching mechanisms that persist between pipeline instances.
Is my understanding correct?

To be more clear, I have the following questions:
Does DALI reuse any previously cached dataset buffers when a new pipeline is created, or does each new pipeline force a full reload from storage?
If a new pipeline reuses allocated memory, does that mean the dataset itself might still be cached in RAM (or another internal buffer) rather than being freshly loaded?
Are random_shuffle=True and cache_header_information=False sufficient to guarantee that each batch is read fresh from disk/memory, even when using the same pipeline instance?
Would manually calling nvidia.dali.backend.ReleaseUnusedMemory() ensure fresh data loading, or does it only affect unused memory blocks?

My goal is to ensure that each iteration loads data fresh from memory (not from cached batches)
Any insights into how DALI handles this at a low level would be highly appreciated

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@zmasih zmasih added the question Further information is requested label Feb 10, 2025
@JanuszL
Copy link
Contributor

JanuszL commented Feb 10, 2025

Hi @zmasih,

DALI maintains a memory pool to speed memory allocation for GPU and CPU pinned memory. However, the pool is not meant to carry information between pipeline instances but rather to save time-related to allocation.

Does DALI reuse any previously cached dataset buffers when a new pipeline is created, or does each new pipeline force a full reload from storage?

DALI doesn't carry such information, but the OS disc cache does. So, as long as data on the drive is unchanged, in many cases, data can be used directly from RAM (when you cal read operation, the OS will use the data from cache, not from storage). You can either ask the OS to drop caches (by writing to /proc/sys/vm/drop_caches) or try using use_o_direct, which bypasses the OS caching mechanism.

@jantonguirao jantonguirao assigned JanuszL and unassigned jantonguirao Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants