containerized code: setting and running containerized code (docker, sarus, singularity, conda) #5507

unkcpz · 2022-04-29T13:24:02Z

This is the last part of the implementation of #5250.

In this PR, the containerized code is allowed to setting through cmdline and used in calculation. The containerized engines supported and tested are docker, Sarus and Singularity. It is shown below how to configure the code and running calculation. I will then move the example below into documentation.

Although all basic features are implemented, this PR still needs to be polished but I think it better to get a review before I move on.

cmdline test for new code type.
daemon CI test with a docker that will run a real test on docker (Could you tell me where better to add it? @sphuber).
documentation and include example below.
local code setup should also work with Sarus and Singularity on HPC, but not fully test yet.
conda supported (edited)

Since running code in the container always require mapping the current directory where the input files locate to the working directory in container, the current directory is specified in the job script by $PWD. Therefore, we need to use double quotes to escape the command line parameters. It is set by setting use_double_quotes for computer setup.

in the computer setup must set use_double_quotes to true, since the engine_command escape is controlled by that and $VAR will not be evaluated otherwise

docker

The docker engine support not only set the code installed in the container but also the code in the DB store where uploaded and run in the container where the container provides the needed libraries.

remote code

The typical code setup for it shown below. The option escape_execline is mandatory for running commands in docker container. It will put the cmdline_params and the redirect parameters in the quotes so the whole command is recogonzed and run inside the container.

---
label: add-docker
description: add docker
default_calc_job_plugin: core.arithmetic.add
on_computer: true
computer: localhost
filepath_executable: /bin/bash
image: ubuntu
engine_command: docker run -v $PWD:/workdir:rw -w /workdir {image} sh -c
escape_exec_line: true
prepend_text: ' '
append_text: ' '

Then you can launch the calculation with script:

from aiida import orm
from aiida.engine import run_get_node
from aiida.plugins import CalculationFactory

ArithmeticAddCalculation = CalculationFactory('core.arithmetic.add') 

inputs = {
    'code': orm.load_code('docker-bash-add@localhost'),
    'x': orm.Int(4),
    'y': orm.Int(6),
    'metadata': {
        # 'dry_run': True,
        'options': {
            'resources': {
                'num_machines': 1,
                'num_mpiprocs_per_machine': 1
            }
        }
    }
}

_, node = run_get_node(ArithmeticAddCalculation, **inputs)

local code (store in db)

The code can be set up by specifying the executable file in the local machine and uploaded to what ever the computers set in the aiida dababase. This is very useful for example when you have a python script that has special dependencies. You can create the image that contains the libraries and then able to running code on all kinds of machine only with docker installed.

The code setup config example is:

label: "docker-python-add"
description: "doing python add"
input_plugin: "core.arithmetic.add"
on_container: true
on_computer: false
image: "python:3.9.12-buster"
engine_command: "docker run -v $PWD:/workdir:rw -w /workdir {image} sh -c"
code_folder: "/home/jyu/Projects/WP-aiida/docker_python_demo/"
code_rel_path: "eval_sh.py"
use_double_quotes: true
prepend_text: " "
append_text: " "

where the executable eval_sh.py is a dummy python script that execute bash < aiida.in > aiida.out only for demo purpose.

#!/usr/bin/env python
import os

if __name__ == '__main__':
    inputfile = 'aiida.in'
    outputfile = 'aiida.out'
    
    with open(inputfile, 'r') as f:
        bashCommand = '/bin/bash < aiida.in > aiida.out'
        os.system(bashCommand)

Running the code by:

from click import Argument
from aiida import orm
from aiida.engine import run_get_node
from aiida.plugins import CalculationFactory

ArithmeticAddCalculation = CalculationFactory('core.arithmetic.add') 

inputs = {
    'code': orm.load_code('docker-python-add'),
    'x': orm.Int(4),
    'y': orm.Int(6),
    'metadata': {
        # 'dry_run': True,
        'computer': orm.load_computer('localhost'),
        'options': {
            'resources': {
                'num_machines': 1,
                'num_mpiprocs_per_machine': 1
            }
        }
    }
}

_, node = run_get_node(ArithmeticAddCalculation, **inputs)

Just specify the computer and no need to worry about the dependencies.

Sarus and Singularity

The Sarus and Singularity share the same logic the only difference comes from the details of running containerized code which can be specified by engine_command when setting the code.

I create a image jusong/qe-mpich314:v01 with q-e 6.8 compiled with MPICH and able to run pw calculation in the container with full parallelization capability.

code setup config files

Sarus

label: "sarus-pw-7.0"
description: "running pw.x in sarus containerized"
input_plugin: "quantumespresso.pw"
on_container: true
on_computer: true
image: "containers4hpc/qe-mpich314:0.1.0"
engine_command: "sarus run --mount=src=$PWD,dst=/workdir,type=bind --workdir=/rundir {image}"
escape_exec_line: False
filepath_executable: "/usr/local/bin/pw.x"
computer: "daint-mc-mr0"
use_double_quotes: true
prepend_text: " "
append_text: " "

For singularity the image is more than a image name but the path of the sif image file. In fact the Sarus also have image download and stored but on specific directory therefore only image name needed.

---
label: "singularity-pw-7.0"
description: "pw.x in singularity container"
default_calc_job_plugin: "quantumespresso.pw"
on_container: true
on_computer: true
image: "/tmp/singtest/qe-mpich314_0.1.0.sif"
engine_command: "singularity exec --bind $PWD:$PWD {image}"
escape_exec_line: False
inner_mpi: False
filepath_executable: "/usr/local/bin/pw.x"
computer: "localhost"
use_double_quotes: true
prepend_text: " "
append_text: " "

conda

First, you need have a conda environment with the executable and mpi installed.
Here I'll show a example of using conda to run pw.x calculation from Quantum ESPRESSO.
Create a new conda environment with conda create -n container-run and install the Quantum ESPRESSO from conda forge conda install -c conda-forge qe.
The executables of QE can be found from <ENV_PATH>/bin.
Then configure the code with the following config yaml file. Notice that for the conda container environment, it is similar to docker in that the MPI command should run from inside the container rather than mapping to the host MPI libraries, so the inner_mpi set to True.
The stdin and stdout is also redirect input and output fully from inside the container (env) which needs to be called by bash -c and inside the single quotes with escape_exec_line set to True.

code config:

---
label: "conda-pw-7.0"
description: "pw.x in canda container"
default_calc_job_plugin: "quantumespresso.pw"
on_container: true
on_computer: true
image: "container-run"
engine_command: "conda run --name {image} bash -c"
escape_exec_line: true
inner_mpi: true
filepath_executable: "/home/jyu/miniconda3/envs/container-run/bin/pw.x"
computer: "localhost"
use_double_quotes: false
prepend_text: " "
append_text: " "

Launching the calculation with the code set.

The typical inputs for the process are all the same as regular code, only need to make sure the special MPI setting is specified for the image.

from aiida import orm, plugins
from aiida.engine import submit

code = orm.load_code('<containerizedo_code_name>@<computer>')
builder = code.get_builder()

structure = orm.load_node(<structure_data_node>)
builder.structure = structure
pseudo_family = orm.load_group('SSSP/1.1/PBE/efficiency')
pseudos = pseudo_family.get_pseudos(structure=structure)
builder.pseudos = pseudos
parameters = {
  'CONTROL': {
    'calculation': 'scf',  # self-consistent field
  },
  'SYSTEM': {
    'ecutwfc': 30.,  # wave function cutoff in Ry
    'ecutrho': 240.,  # density cutoff in Ry
  },
}
builder.parameters = orm.Dict(dict=parameters)

KpointsData = plugins.DataFactory('array.kpoints')
kpoints = KpointsData()
kpoints.set_kpoints_mesh([4,4,4])
builder.kpoints = kpoints

builder.metadata.options.resources = {'num_machines': 1, 'num_mpiprocs_per_machine': 2}

calcjob_node = submit(builder)

unkcpz · 2022-04-29T13:26:50Z

@ltalirz It would be nice if you can running a more complex docker local code test example. Unfortunately, I don't have the use case and plugins for this type of code.

sphuber · 2022-05-02T21:48:53Z

Thanks @unkcpz . I think having support for containerized codes is great, and so we should push this through. However, I think the negative consequences of the original design of the Code plugin are getting exacerbated by adding yet another subclass. The CLI is becoming very complex with all the various options and branches to follow. I think a redesign of Code would significantly simplify all of this.

I have been wanting to do that for a long time, but was always hesitant given that it is such a fundamental part of aiida-core that we cannot break. Now I have taken the leap and made an attempt to improve things. See #5509 for a description of the problem and #5510 for the first implementation. I think that this new approach should make adding your ContainerizedCode a breeze. Maybe we can meet sometime to go through it and see how your implementation would look after my refactoring, to see if there are further changes that are needed.

unkcpz · 2022-05-02T22:30:07Z

@sphuber thanks for the quick refactoring of Code class. I agree it is the right direction for this PR, have to say I struggle a lot from the confusing local/remote code. I will have a look at #5510 to see if it makes this one simpler.

ltalirz · 2022-05-04T01:01:13Z

Thanks a lot for getting this PR ready @unkcpz and sorry for the late reply - is there any particular docker feature you would like me to test?

Comments from my side:

I personally don't care about the local codes. We once did a survey on the AiiDA mailing list about who used these and nobody replied. I never encountered someone who used them. In my view they create more confusion than benefit and we should drop support for them (and you don't need to implement support for this feature in containerized codes)

As to where to run a docker test, I would suggest to add it as a "system tests". See e.g.

aiida-core/.github/system_tests/pytest/test_memory_leaks.py

Lines 48 to 62 in 80d1125

    
           def test_leak_ssh_calcjob(): 
        
               """Test whether running a CalcJob over SSH leaks memory. 
        
               Note: This relies on the 'slurm-ssh' computer being set up. 
        
               """ 
        
               code = orm.Code( 
        
                   input_plugin_name='core.arithmetic.add', remote_computer_exec=[orm.load_computer('slurm-ssh'), '/bin/bash'] 
        
               ) 
        
               inputs = {'x': orm.Int(1), 'y': orm.Int(2), 'code': code} 
        
               run_finished_ok(ArithmeticAddCalculation, **inputs) 
        
               # check that no reference to the process is left in memory 
        
               # some delay is necessary in order to allow for all callbacks to finish 
        
               process_instances = get_instances(processes.Process, delay=0.2) 
        
               assert not process_instances, f'Memory leak: process instances remain in memory: {process_instances}'

for a test that runs a job on a slurm docker container created by github actions (you would not need that container, your test would create its own container)

unkcpz · 2022-05-04T06:57:29Z

I would suggest to add it as a "system tests".

@ltalirz thanks for the suggestion.

Me and Seb will meet tomorrow to settle the issues and I'll then try to finish this PR, maybe without implementing local code for containerized code, in order to keep the first introduction of containerized code concept simple and useful.

unkcpz · 2022-05-24T16:05:29Z

@sphuber I reimplement this with new code, it really simplifies the implementation a lot. But there are still some open questions about the actual use of this containerized code.

Although I start this task with lots of tests on SARUS and Singularity, after using Singularity also on my local machine and some other deployments of Singularity, I find it is not easy to uniform all kinds of engine commands. But the good thing here is in the current implementation, user has all flexibility to set engine command.
The executable of docker, sarus, singularity hard to be validate as filepath executable of Installed Code using verdi code test. Since the like modules load information is not easy to separate from prepned_text. And even the container executable itself is in the engine command. I can add an extra option for it but I think it will increase the burden to code user.
It takes time to download image first time in docker run. And it will fail for Singularity and Sarus if image is not set which require user to fetch it manually. There is solution that we can add a command to do this in remote machine since the commands to fetch image are all similar.

About this PR, if @sphuber good with current code structure, I'll go ahead with add CI test and documentation. And running some production run on my wrapped up pseudopotential generator code in container.

Also pinning @giovannipizzi for comment.

unkcpz · 2022-05-24T16:10:44Z

Another thought about the verdi code create, it accepts the entry point of codes, core.code.installed and core.code.portable. Is that much simpler to hide the duplicate information of core.code since it must be the code?
I can image one reason that for other code type implementable not from aiida core this will be used.

sphuber

Thanks @unkcpz . Code is looking good, but have some minor changes/simplifications. It would indeed be good to have some examples of how to setup and run these with docker, singularity and sarus.

I think the verdi code test question can be left for later. This was only recently added and only really does something for InstalledCode. Don't think we need it for the containerized codes now, anyway they are a new experimental feature that will require some testing.

sphuber · 2022-05-25T06:46:16Z

aiida/engine/daemon/execmanager.py

+                    try:
+                        handle.write(code.base.repository.get_object_content(filename, mode='rb'))
+                    except:
+                        # raise TypeError('directory not supperted.')
+                        pass


Instead of doing this, I think it would maybe be nicer to use:

from aiida.repository import FileType for obj in filter(code.base.repository.list_objects(), lambda o: o.file_type == FileType.FILE): with NamedTemporaryFile(mode='wb+') as handle: handle.write(code.base.repository.get_object_content(filename, mode='rb'))

Actually, thinking about this, this code is wrong. It only copies top-level files, but doesn't recurse into directories. Probably there is no test that actually tests this case. We should actually use code.base.repository.walk and iterate over all the files and copy those. Maybe I will quickly fix this in a separate PR.

Yes, I put the pass here for a further look into it and forget about it. I think what I expected here is as you said to walk through the directory and copy all from inside. This will then also support the that filepath_executable of portable code is a real relative path of code inside a subfolder.

sphuber · 2022-05-25T06:47:31Z

aiida/engine/daemon/execmanager.py

@@ -174,14 +175,18 @@ def upload_calculation(
    # Still, beware! The code file itself could be overwritten...
    # But I checked for this earlier.
    for code in input_codes:
-        if isinstance(code, PortableCode):
+        if isinstance(code, (PortableCode, PortableContainerizedCode)):


Suggested change

if isinstance(code, (PortableCode, PortableContainerizedCode)):

if isinstance(code, PortableCode):

Since PortableContainerizedCode is actually a subclass, you don't have to specifically add it.

sphuber · 2022-05-25T06:47:44Z

aiida/engine/processes/calcjobs/calcjob.py

@@ -611,7 +620,8 @@ def presubmit(self, folder: Folder) -> CalcInfo:
                    )
                )

-            if isinstance(code, PortableCode) and str(code.filepath_executable) in folder.get_content_list():
+            if isinstance(code, (PortableCode, PortableContainerizedCode)) and str(code.filepath_executable


Suggested change

if isinstance(code, (PortableCode, PortableContainerizedCode)) and str(code.filepath_executable

if isinstance(code, PortableCode) and str(code.filepath_executable

sphuber · 2022-05-25T06:48:08Z

aiida/engine/processes/calcjobs/calcjob.py

+            this_code = load_node(
+                code_info.code_uuid,
+                sub_classes=(Code, InstalledCode, PortableCode, InstalledContainerizedCode, PortableContainerizedCode)
+            )


Suggested change

this_code = load_node(

code_info.code_uuid,

sub_classes=(Code, InstalledCode, PortableCode, InstalledContainerizedCode, PortableContainerizedCode)

)

this_code = load_code(code_info.code_uuid)

sphuber · 2022-05-25T06:48:37Z

aiida/engine/processes/calcjobs/calcjob.py

@@ -715,10 +728,20 @@ def presubmit(self, folder: Folder) -> CalcInfo:
            else:
                prepend_cmdline_params = []

+            escape_exec_line = False
+            if isinstance(this_code, (InstalledContainerizedCode, PortableContainerizedCode)):


Suggested change

if isinstance(this_code, (InstalledContainerizedCode, PortableContainerizedCode)):

if isinstance(this_code, ContainerizedCode):

Since they share this base class, why not use that?

sphuber · 2022-05-25T07:10:01Z