MultiOutputGP emulators run in parallel on GPU #156

ots22 · 2021-02-18T11:08:36Z

Currently runs in serial (through usual MultiOutputGP python class), since can't pickle GaussianProcessGPU for multiprocessing

Options to fix:

~~Run each output GP serially (for the initial version: still expect considerable speedup in many cases)~~ - currently implemented
~~Add pickling/unpickling (see an older version that had this)~~ - done, but see below
Handle multiple GPs within library code

One recent multi-emulator example has 60 input points, 2000 prediction batch and 100000 output emulators.

The text was updated successfully, but these errors were encountered:

nbarlowATI · 2021-03-03T17:51:31Z

Summary of our process / thought process during a session of working on this:

The obstacle that we hit before was pickling/unpickling as required by Multiprocessing.starmap. Using __setstate__ and __getstate__ from the previous version of GaussianProcessGPU, we can now pickle and unpickle.
However, we still see an error when we try to use a MultiOutputGP containing GaussianProcessGPU

terminate called after throwing an instance of 'thrust::system::system_error'
what(): device free failed : initialization error

In any case we think that using starmap for predict will be problematic, as it will need to refit each emulator after unpickling in the new process.
We then consider two alternative possibilities:
- Running the emulators in serial in MultiOutputGP - possibly useful as a quick first step to get feature parity with CPU version in order to merge the branch
- Making a C++ MultiOutputGP - this would seem to be the best approach - have a C++ class that mirrors the structure of the Python MultiOutputGP (i.e. owns several DenseGP_GPU objects).
We also spent some time investigating whether the destructor of DenseGP_GPU is ever called when the python object is deleted - it didn't seem to be
- What to do about this? Write a custom method that mimics a destructor? Write a wrapper class that owns a DenseGP_GPU and has a cleanup() method to delete it? Maybe having a C++ MultiOutputGP will solve this for us.

ots22 · 2021-03-10T09:55:14Z

Some more thoughts:

MultiOutputGP provides direct access to individual the emulators (as GaussianProcess objects), and some functionality depends on this (e.g. fitting), which would have to be duplicated otherwise. A MultiOutputGPGPU class should provide the same interface if possible.
Adding the ability to construct GaussianProcessGPU objects from DenseGP_GPU objects (via pybind), with minimal construction overhead, would allow the CUDA/C++ code to own the collection of emulators, and return DenseGP_GPU objects to python
This means GaussianProcessGPU should not keep its own copy of inputs/targets (we'd need to rethink pickling)
It's unclear how to get this to work well with Multiprocessing

Noting @edaub's recent changes in #178

ots22 added the gpu label Feb 18, 2021

ots22 mentioned this issue Feb 18, 2021

MultiOutputGP performant on GPU #157

Closed

ots22 added this to the Merge feature/gpu milestone Feb 18, 2021

ots22 changed the title ~~MultiOutputGP functional on GPU~~ MultiOutputGP emulators run in parallel on GPU Mar 3, 2021

ots22 removed this from the Merge feature/gpu milestone Mar 3, 2021

Provide feedback