Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when running MCMC in parallel #230

Open
bocklund opened this issue May 3, 2022 · 3 comments
Open

Memory leak when running MCMC in parallel #230

bocklund opened this issue May 3, 2022 · 3 comments

Comments

@bocklund
Copy link
Member

bocklund commented May 3, 2022

Due to a known memory leak when instantiating subclasses of SymEngine (one of our upstream dependencies) Symbol objects (see symengine/symengine.py#379), running ESPEI with parallelization will cause memory to grow in each worker.

Only running in parallel will trigger significant memory growth, because running in parallel uses the pickle library to serialize and deserialize symbol objects and create new objects that can't be freed. When running without parallelization (mcmc.scheduler: null), new symbols are not created.

Until symengine/symengine.py#379 is fixed, some mitigation strategies to avoid running out of memory are:

  • Run ESPEI without parallelization by setting scheduler: null
  • (Under consideration to implement): when parallelization is active, use an option to restart the workers every N iterations.
  • (Under consideration to implement): remove Model objects from the keyword arguments of ESPEI's likelihood functions. Model objects contribute a lot of symbol instances in the form of v.SiteFraction objects. We should be able to get away with only using PhaseRecord objects, but there are a few places Model.constituents to be able to infer the sublattice model and internal degrees of freedom that would need to be rewritten.
@bocklund
Copy link
Member Author

bocklund commented Jul 2, 2022

Code to restart the workers would look something like applying this patch to opt_mcmc.py, this would hard code the restart interval, but it could be added as a parameter and propagated through:

diff --git a/espei/optimizers/opt_mcmc.py b/espei/optimizers/opt_mcmc.py
index 4d08ccf..c167e9e 100644
--- a/espei/optimizers/opt_mcmc.py
+++ b/espei/optimizers/opt_mcmc.py
@@ -167,12 +167,19 @@ class EmceeOptimizer(OptimizerBase):
     def do_sampling(self, chains, iterations):
         progbar_width = 30
         _log.info('Running MCMC for %s iterations.', iterations)
+        scheduler_restart_interval = 50  # iterations
         try:
             for i, result in enumerate(self.sampler.sample(chains, iterations=iterations)):
                 # progress bar
                 if (i + 1) % self.save_interval == 0:
                     self.save_sampler_state()
                     _log.trace('Acceptance ratios for parameters: %s', self.sampler.acceptance_fraction)
+                if (self.scheduler is not None) and ((i + 1) % scheduler_restart_interval == 0):
+                    # Note: resetting the scheduler will reset the logger settings for the workers
+                    # You'd typically want to run the following, but the verbosity/filename are out of scope
+                    # self.scheduler.run(espei.logger.config_logger, verbosity=log_verbosity, filename=log_filename)
+                    self.scheduler.restart()
+                    _log.info(f"Restarting scheduler (interval = {scheduler_restart_interval} iterations)")
                 n = int((progbar_width) * float(i + 1) / iterations)
                 _log.info("\r[%s%s] (%d of %d)\n", '#' * n, ' ' * (progbar_width - n), i + 1, iterations)
         except KeyboardInterrupt:

@bocklund
Copy link
Member Author

bocklund commented Aug 4, 2024

A potential mitigation is to cache symbols so new instances aren't created:

from pycalphad.core.cache import lru_cache

class StateVariable(Symbol):
    @lru_cache(maxsize=10000)
    def __new__(cls, *args, **kwargs):
        return super().__new__(cls, *args, **kwargs)

On recent development branches with parallelization in ESPEI turned off, the memory is appears constant (at least to MB resolution) once the cache is warm. Compared to without this mitigation, I'm getting an increase in about 1 MB/s of RAM. With parallelization on, memory usage grows both ways, but less so with this mitigation applied. Perhaps warming the cache with a runthrough of the likelihood function in an ESPEI run that happens before forking the process (spinning up the dask workers) would resolve it completely?

@bocklund
Copy link
Member Author

bocklund commented Aug 6, 2024

I did some memory benchmarking on this for running ESPEI in parallel. Without the fix, both the parent and child processes grow in memory usage. The main process hit about 8GB after some number of steps, and the workers hit 4GB each. After applying the fix above, the parent process still gets up to 8GB, but the children stay relatively lean at 350MB, and grow much slower (after the same number of iterations). Applying another modification where we run predict before creating the dask local cluster (which forks processes), that seemed to make no impact on the memory usage compared to only using the symbol cache.

It's unclear why the main process seems to leak a constant amount when run in parallel, but at least there is evidence that this fix is meaningfully mitigating the memory leak from symbol creation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant