Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak #37401

Open
rbauststfc opened this issue May 20, 2024 · 15 comments
Open

Possible memory leak #37401

rbauststfc opened this issue May 20, 2024 · 15 comments
Assignees
Labels
Bug Issues and pull requests that are regressions or would be considered a bug by users (e.g. crashing) ISIS Team: Core Issue and pull requests managed by the Core subteam at ISIS Linux Only Only on Linux
Milestone

Comments

@rbauststfc
Copy link
Contributor

rbauststfc commented May 20, 2024

Please read all comments to get full picture of the issue, as the problem appears to be more general than first described here.

Describe the bug

We've had reports of Mantid crashing due to memory use when users are auto-processing INTER experiment data while time slicing. I've done a bit of investigating and it looks to me like there may be a memory leak somewhere. This is most noticeable when we time slice in the GUI.

To Reproduce

The easiest way to replicate is as follows:

  1. Go to the Interfaces -> Reflectometry -> ISIS Reflectometry.
  2. On the Runs tab, in the first row of the table on the right hand side enter run 77064 and angle 0.8.
  3. On the Event Handling tab, under the "Uniform slicing" section select the (sec) slices radio box and enter 5 to slice the data into 5 second slices.
  4. On the Experiment Settings tab enter the following into the first row of the Settings Lookup table:
    • ROI: 3-258
    • ROI Detector IDS: 3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256
  5. Back on the Runs tab, click the process button. This should complete successfully and create a set of workspace groups. Take a note of the memory usage at this stage.
  6. Clear the workspace list using the button in Mantid Workbench. The memory use won't go down, but I don't think this in itself necessarily indicates a memory leak, as it can just be that the system hasn't retrieved the memory yet. To demonstrate the leak, in the Reflectometry GUI click to process the row again. The memory usage should continue to build up. If you keep doing this in a loop (5 or 6 times usually, clearing the workspace list in between) the memory completely fills up and Mantid eventually crashes.

If I do the above without the GUI (i.e. calling ReflectometryISISLoadAndProcess from a script while time-slicing and clearing workspaces between each run) then the memory doesn't seem to build up very much at all. If I run the algorithm directly from the algorithm dialog then the memory does build up more quickly and eventually causes a crash.

Expected behavior

Repeated processing of the data should not cause the memory to fill up and cause a crash if the workspace list is being cleared in between.

Platform/Version (please complete the following information):

  • IDAaaS, can't replicate on Windows, need to test on Linux and potentially Mac
  • Mantid Version - seems to be a problem in the Nightly, can't replicate it in version 6.9.1 or earlier.

Additional Information
Another way I've tested is to run this script repeatedly, which does not perform the time slicing, clearing the workspaces in between:

from mantid.simpleapi import *
import matplotlib.pyplot as plt
import numpy as np

num = 77064
for i in range(50):
    num = num + 1
    ReflectometryISISLoadAndProcess(InputRunList=str(num),ThetaIn=0.8,ROIDetectorIDs="3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256",ProcessingInstructions="3-258")

I haven't been able to crash Mantid using this script, though, even though the memory fills up to over 90% if run repeatedly. The memory builds up pretty slowly when tested in this way.

@rbauststfc rbauststfc added Bug Issues and pull requests that are regressions or would be considered a bug by users (e.g. crashing) Reflectometry Issues and pull requests related to reflectometry ISIS Team: LSS Issue and pull requests managed by the LSS subteam at ISIS Reported By User Issues that were found or highlighted by a user/scientist labels May 20, 2024
@rbauststfc rbauststfc added this to the Release 6.10 milestone May 20, 2024
@rbauststfc rbauststfc self-assigned this May 20, 2024
@rbauststfc rbauststfc changed the title Memory leak when time slicing in ISIS Reflectometry GUI Memory leak when reducing ISIS Reflectometry data May 20, 2024
@rbauststfc
Copy link
Contributor Author

rbauststfc commented May 20, 2024

@cailafinn would you be happy to double check this on IDAaaS and check if you see the same behaviour on your Linux machine (and maybe Mac)? I would recommend using the first set of testing instructions (i.e. the time slicing example). While the script I've provided for the second way of causing the memory to fill up gets to over 90%, I can't seem to get it to the point where it crashes Mantid, so I think the first set of instructions is the most reliable way to demonstrate the problem.

@rbauststfc rbauststfc changed the title Memory leak when reducing ISIS Reflectometry data Possible memory leak when reducing ISIS Reflectometry data May 21, 2024
@cailafinn
Copy link
Contributor

cailafinn commented May 21, 2024

Reproduced on Ubuntu 22.04 (Nightly conda package). Took two repeats of clicking Process to crash on a system with 32GiB of RAM. When the ADS was cleared, memory usage dropped somewhat, but not significantly. See below.

Breakdowns:

Nightly:

  • Start:
    • Mantid: 0.5GiB
    • System: 5.2GiB
  • 1st Process:
    • Mantid: 15.8GiB
    • System: 20.4GiB
  • ADS Clear
    • Mantid: 13.3GiB
    • System: 18.1GiB
  • 2nd Process
    • Usage rose over 5-7 seconds before reaching the limits of the system until workbench froze and was crashed by the OOM Killer.

V9.6.1 and earlier

  • Start:
    • Mantid: 0.4GiB
    • System: 1.1GiB
  • 1st Process:
    • Mantid: 15.8GiB
    • System: 16.4GiB
  • ADS Clear
    • Mantid: 13.3GiB
    • System: 14.0GiB
  • 2nd Process:
    • Mantid: 15.8GiB
    • System: 16.5GiB
  • 3rd Process:
    • Mantid: 15.7GiB
    • System: 16.5GiB

This would appear to be a regression.

@rbauststfc
Copy link
Contributor Author

rbauststfc commented May 21, 2024

As Caila has suggested, we should see if this PR to address issues with jemalloc has any impact on this issue. We've not made any changes to the Reflectometry code since 6.9.1 that seem like they should be causing this, and it isn't happening on Windows. I'll test on IDAaaS when that PR has gone in and see where we are then.

@rbauststfc
Copy link
Contributor Author

This doesn't appear to have been helped by the jemalloc change. We're still waiting on results from a Valgrind scan. In the meantime I've done some more testing on IDAaaS and have found this alternative way to replicate the behaviour:

  1. Load data using LoadEventNexus with the following settings:
    • Filename = INTER77064
    • LoadMonitors = True
    • MonitorsLoadOnly = Histogram
  2. From the algorithm dialog, run ReflectometrySliceEventWorkspace with the following settings:
    • InputWorkspace = name of loaded workspace
    • MonitorWorkspace = name of loaded monitor workspace
    • UseNewFilterAlgorithm = True (note the memory builds up regardless of how this is set)
    • TimeInterval = 5

It should show the same behaviour as the Reflectometry GUI. If ReflectometrySliceEventWorkspace is called repeatedly via a script, though, then the memory doesn't seem to build up in the same way, it seems to be necessary to call it from a GUI.

@rbauststfc
Copy link
Contributor Author

Depending on the results of the Valgrind scan, if further investigation is needed it will need to be done by someone with access to a Linux machine as I still can't replicate on Windows.

@rbauststfc
Copy link
Contributor Author

rbauststfc commented May 24, 2024

I've done a bit more testing on IDAaaS and this is starting to look like a more general problem. I can replicate it using the LoadISISNexus algorithm to load a WISH dataset.

I've found that if you run the following script then it doesn't build up the memory:

for i in range(3):
    LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
    AnalysisDataService.clear()

However if you do the above in any of the following ways then the memory does build up:

  • Run LoadISISNexus via the algorithm dialog and clear the ADS using the Clear button.
  • Run LoadISISNexus via the algorithm dialog and clear the ADS by running AnalysisDataService.clear() in the script editor.
  • Run LoadISISNexus via the script editor and clear the ADS using the Clear button.

This doesn't seem to happen in version 6.9.1 of Mantid, and again doesn't happen on Windows. When the memory fills up with this test it seems to eventually cause Mantid to freeze and become unresponsive on IDAaaS rather than crash.

@rbauststfc rbauststfc removed Reported By User Issues that were found or highlighted by a user/scientist Reflectometry Issues and pull requests related to reflectometry labels May 24, 2024
@rbauststfc rbauststfc changed the title Possible memory leak when reducing ISIS Reflectometry data Possible memory leak May 24, 2024
@warunawickramasingha
Copy link
Contributor

warunawickramasingha commented May 24, 2024

The steps were followed in IDAaaS using Valgrind memory analyser tool to check for any memory leaks. The main observation was that there was a discrepancy in the amount of memory reported as leaked by Valgrind and the memory usage reported by the Mantid workbench as shown below. As can be seen, there has not been such a significant memory leak in this case.

  • Valgrind summary ->

leaks

  • System memory usage reported by Mantid at the end of test step 6 ->

memory_usage

Valgrind log file for reference
37401_valgrind.zip

@cailafinn
Copy link
Contributor

cailafinn commented May 28, 2024

One billion permutations later, here are my results. Testing on macOS to follow.

Testing on Ubuntu 22.04. 32GiB RAM.

Run Script Once: Loop Loading Call

for i in range(15):
    AnalysisDataService.clear()
    LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
  • Does not crash the program, even if the same load and clear function is performed multiple times within the script.

Run Script Multiple Times: Single Loading Call

# Repeatedly ran 3 times.
AnalysisDataService.clear()
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
  • With each re-run, the memory usage was seen to drop by around 2.3GiB when the ADS was cleared.
  • Memory usage then rose again. This could only be repeated 3 times before Mantid took up too much memory and was killed by the OoM Killer.
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.3 14.8 13.1 26.2 23.9 CRASH
6.9.1 0.3 14.8 13.4 26.2 23.8 CRASH
6.7 0.4 14.6 12.3 14.8 12.6 26.3 23.8 26.0 24.1 CRASH
  • Behaviour appears to be the same.

Run Script Multiple Times: Swap Clear Order

# Repeatedly ran 4 times.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
AnalysisDataService.clear()
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.8 12.3 26.2 23.6 25.8 23.8 CRASH N/A N/A
6.9.1 0.4 14.8 12.3 25.7 23.7 25.9 23.9 25.8 24.1 CRASH
  • Behaviour is similar. This way around was a bit more variable for some reason, but I don't think this one is specifically a regression.

No Script

# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button. 
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.4 14.8 12.3 26.2 23.7 CRASH
6.9.1 0.4 14.8 12.3 14.5 12.5 14.6 12.7 14.7 12.9 14.7
6.7 0.4 14.8 12.3 14.8 12.4 14.8 12.6 14.7 12.7 14.7
  • Okay, beginning to see a difference here.

Load with Algorithm Dialog, Clear From Script

# Click Load and load WISH57086 from the algorithm dialog
# Run below from a script
AnalysisDataService.clear()
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.8 12.3 26.2 24.9 CRASH
6.9.1 0.5 14.7 12.3 14.8 12.5 14.7 13.1 14.5 13.2 14.7
  • Same behaviour as with no scripting involved.

Load with a script, Clear from the Button

# Run the below script to load WISH57086
# Then clear using the button on the workbench GUI.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.7 12.3 26.2 23.7 CRASH
6.9.1 0.4 14.6 13.4 26.2 23.7 CRASH
6.7 0.4 13.8 12.3 26.2 23.7 26.2 23.9 CRASH
  • Okay, so the total of all of this seems to tell us is that something has changed when loading from the algorithm dialog. Crashes occur on both versions when loading from scripts, regardless of what is done with the ADS clear. The only thing that stops the crashes from occurring in 6.9.1 is using the algorithm dialog to load the data.

@cailafinn
Copy link
Contributor

cailafinn commented May 28, 2024

Testing on macOS 14.4.1 (Intel). 32GB RAM.

No Script

# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button. 
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.4 5.5 2.1 7.4 2.9 7.5 6.2 10.0 3.3 12.2
6.7 0.5 12.5 2.5 12.5 3.1 12.5 3.3 12.6 etc.
  • Despite the workspace appearing in the "Workspaces" list as 20GB, it didn't appear to take all that up as RAM. I wonder if there's something different about how macOS handles the memory for large files?
  • Regardless, there doesn't seem to be an issue on macOS, it would appear to be specifically on linux.

@rbauststfc rbauststfc added the Linux Only Only on Linux label May 28, 2024
@thomashampson
Copy link
Contributor

@cailafinn Are you launching with the workbench entry point, or using the mantidworkbench script?

@cailafinn
Copy link
Contributor

cailafinn commented May 28, 2024

workbench entry point on both OS'. Always using the version from conda.

@rbauststfc
Copy link
Contributor Author

We've noticed that the version of the Nightly from 24th May (which is the first to include the jemalloc pin) is looking much better on IDAaaS now, when started with the launcher from the Applications menu (FYI @sf1919). Tom has explained that launching the Nightly using the IDAaaS launcher sets up jemalloc, whereas using the workbench entry point does not use the script that sets up jemalloc.

@cailafinn
Copy link
Contributor

Re-tested launching using mantidworkbench:

Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.7 12.3 26.2 23.7 CRASH
N mw 0.5 15.9 4.3 16.2 4.2 16.2 4.2 16.2 4.5 16.1
6.9.1 0.4 14.6 13.4 26.2 23.7 CRASH
6.7 0.4 13.8 12.3 26.2 23.7 26.2 23.9 CRASH

Seems to fix the issue. Do we need to alter the launch instructions for linux? We currently recommend starting it using the workbench entry in the Downloads docs.

@thomashampson
Copy link
Contributor

thomashampson commented May 28, 2024

Some of my findings when loading and clearing all via the GUI (all nightly versions here are 6.9.20240524.1804):
On Ubuntu:

  • v6.9.1 is fine with and without jemalloc (mantidworkbench and workbench)
  • nightly version fills up memory and freezes with and without jemalloc

On IDAaaS:

  • nightly version mantidworkbench (with jemalloc) appears to fill up the memory, but is actually OK because it gets release just in time
  • nightly version via workbench freezes after a few iterations and eventually crashes
  • 6.9.1 seems to be fine in both cases. With jemalloc it seems to give the memory back much quicker than with the nightly. Maybe because we downgraded from 5.3.0 to 5.2.0.

@rbauststfc rbauststfc added the ISIS Team: Core Issue and pull requests managed by the Core subteam at ISIS label May 29, 2024
@sf1919 sf1919 modified the milestones: Release 6.10, Release 6.11 Jun 4, 2024
@sf1919
Copy link
Contributor

sf1919 commented Jun 12, 2024

With the further changes to IDAaaS last week how are things looking @thomashampson ?

@rbauststfc rbauststfc removed the ISIS Team: LSS Issue and pull requests managed by the LSS subteam at ISIS label Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Issues and pull requests that are regressions or would be considered a bug by users (e.g. crashing) ISIS Team: Core Issue and pull requests managed by the Core subteam at ISIS Linux Only Only on Linux
Projects
Status: No status
Development

No branches or pull requests

5 participants