-
Notifications
You must be signed in to change notification settings - Fork 7
Azure batch
Please visit this page to setup your environment to use Azure Batch.
- services:
- azure batch account
- azure storage account
- ?azure container registry: hosting docker images
- ?azure service principle: allow tasks to pull from azure container registry
- ?data factory: could be useful for parameterised running, but expect just upload script with configuration?
- BatchExplorer allows better interaction with pools, jobs, nodes and data storage
Structure of running jobs:
-
Pools
- Define VM configuration for a job
- Best practice
- Pools should have more than one compute node for redundancy on failure
- Have jobs use pools dynamically, if moving jobs move them to new pool and once complete delete the old pool
- Resize pools to zero every few months
-
Applications
-
Jobs
- Set of tasks to be run
- Best practice
- 1000 tasks in one job is more efficient than 10 jobs with 100 tasks
- Job has to be explicitly terminated to be completed, onAllTasksComplete property/maxWallClockTime does this
-
Tasks
- individual scripts/commands
- Best practice
- task nodes are ephemeral so any data will be lost unless uploaded to storage via OutputFiles
- retention time is a good idea for clarity and cleaning up data
- Bulk submit collections of up to 100 tasks at a time
- should build for some retry to withstand failure
-
Images
- Custom images with OS
- the storage blob containing the VM?
- conda from: linux datascience vm
- windows has python 3.7
- linux has python 3.5, but could install fstrings
All of these are defined at the pool level.
- Define start task
- Each compute node runs this command as it joins the pool
- Seems slow wasteful to run this for each node
-
create an application=package
- zip file with all dependencies
- can version these and define which version you want to run
- Issue with default version of Python on azure batch linuxa
- Seems like a pain to do and redo when updating requirements or applications
-
Use a custom image
- limit of 2500 dedicated compute nodes or 1000 low priority nodes in a pool
- can create a VHD and then import it for batch service mode
- linux image builder or can use Packer directly to build a linux image for user subscription mode
- Seems like a reasonable option if the framework is stable
-
Use containers
- can prefetch container images to save on download
- They suggest storing and tagging the image on azure container registry
- Higher cost tier allows for private azure registry
- Can also pull docker images from other repos
- Most flexible option without having too much time sent on node setup
- Can use docker images or any OCI images.
- Is there a benefit for sinularity here?
- VM without RDMA
- Publisher: microsoft-azure-batch
- Offer: centos-container
- Offer: ubuntu-server-container
- need to configure batch pool to run container workloads by ContainterConfiguration settings in the Pool's VirtualMachineConfiguration
- prefetch containers - Use Azure container registry in teh same region as the pool
image_ref_to_use = batch.models.ImageReference(
publisher='microsoft-azure-batch',
offer='ubuntu-server-container',
sku='16-04-lts',
version='latest')
"""
Specify container configuration, fetching the official Ubuntu container image from Docker Hub.
"""
container_conf = batch.models.ContainerConfiguration(
container_image_names=['custom_image'])
new_pool = batch.models.PoolAddParameter(
id=pool_id,
virtual_machine_configuration=batch.models.VirtualMachineConfiguration(
image_reference=image_ref_to_use,
container_configuration=container_conf,
node_agent_sku_id='batch.node.ubuntu 16.04'),
vm_size='STANDARD_D1_V2',
target_dedicated_nodes=1)
...
- maybe try batch shipyard exists for deploying HPC workloads,
- nice monitoring, task factory based on parameter sweeps, random or custom python generators
- might be a bit more than we need.
-
python batch examples
- ran the first few examples, straightforward
Running python script in azure
- using the batch explorer tool, can find the data science desktop
- select VM with start task for installing requirements
- use and input and ouput storage blobs for input and output
- create an azure data factory pipeline to run the python script on inputs and upload outputs
- Have deployed simple docker project https://github.com/stefpiatek/azure_batch-with_docker
- uses azure container registry for hosting docker images
- uploads multiple scripts and have a node run a script each
- then post-processing task run on one node (would be aggregation of runs)
- azure pipelines guide
- Need to have an azure DevOps organisation, need to be an admin of the Azure DevOps project
- Create pipeline using azure-piplines.yml (dev ops generates one for you)
- Can automatically generate your tag with the commit id
- Or only build when you've explicitly tagged in git
- A scenario is essentially an analysis script, where all of the boilerplate is handled by the parent
BaseScenario
class - Draw: Set of parameter values that are overridden in the simulation, from a python-defined distribution
- 100s sets of draws for override parameters
- draw index: enumeration of the draw set
- Sample: an individual run of a simulation, each simulation being set with a different seed
- 1000s of seeds per draw
Local:
- Create a Scenario class
- inheriting from
BaseScenario
, which handles all of the setup and running of the simulation - Set up configuration (start date, pop size etc.)
- Define registered modules and their configuration
- Define overriding parameters for modules which are drawn from a distribution (a set of these is a draw)
- inheriting from
- Can run this script directly using pycharm at this point
-
if __name__ == '__main__':
block sets up and runs the scenario class - can adjust population and duration for local running in this block
-
- Ensure that parameter draws look reasonable
- Using TLO command line interface, create draw configuration files
-
Optional:
run a single sample from a draw configuration file.
- Commit final changes and push branch
Optional on azure:
- Log into dedicate compute node via ssh (similar to myriad or any other cluster)
- Pull branch with work on it
- Using tlo command line too, create draw configuration files
- Run a single sample from a draw configuration file
Local: submit to azure
- Use TLO command line interface to submit job to azure
- python script that contains the scenario class
- branch to use (defaults to master)
- commit id to use (defaults to latest commit)
- scenario seed
5
- number of draws
100
- samples to run
1-1000
- Job sucessfully submitted message
Using BatchExplorer application
- login and view status of nodes
- Check status of nodes by looking at the stdout.txt file on a node
- Job failure: check status or automated email?
- All samples complete:
- All data pooled together - zip of dataframes for each sample or combining of dataframes?
- Check status on azure batch? email?
- Download pooled data using BatchExplorer to local machine
- Carry out analysis by reading in pickle files
Setup and running of simulation is all handled within the BaseScenario
.
from numbers import Real
from typing import Dict, List, Type, Union
from tlo import BaseScenario, Date, Module, logging
from tlo.methods import (
contraception,
demography,
enhanced_lifestyle,
healthseekingbehaviour,
healthsystem,
labour,
pregnancy_supervisor,
symptommanager,
)
class EnhancedLifestyleScenario(BaseScenario):
def __init__(self, scenario_seed: int = 1):
"""
Example scenario setting up all expected custom configuration for its simulations.
:param scenario_seed: seed for the scenario as a whole
"""
# initialise base scenario
super().__init__(scenario_seed)
# Set custom data
self.start_date = Date(2010, 1, 1)
self.end_date = Date(2050, 1, 1)
self.pop_size = 1_000_000
self.log_config['custom_levels'] = {
# Customise the output of specific loggers. They are applied in order:
"*": logging.CRITICAL,
"tlo.methods.demography": logging.INFO,
"tlo.methods.enhanced_lifestyle": logging.INFO
}
def draw_parameters(self) -> Dict[Type[Module], Dict[str, Union[Real, List[Real]]]]:
"""
Creates dictionary that defines overriding parameters for specific modules in the form:
{tlo.methods.DiseaseModule: {'parameter_1': parameter_value_1, 'parameter_2': parameter_value_2}}.
Parameters which are not manually set to one value should use self.rng to draw from a distribution.
:return: dictionary that instructs how to override parameters
"""
return {
demography.Demography: {
'fraction_of_births_male': self.rng.randint(480, 500) / 1000
},
contraception.Contraception: {
'r_init_year': 0.125,
'r_discont_year': self.rng.exponential(0.1),
},
}
def get_modules(self) -> List[Module]:
"""
Creates list of modules to be registered in the simulation.
For the resources at 'TLOmodel/resources' use self.resources
:return: list of modules to be registered in the simulation.
"""
# Used to configure health system behaviour
service_availability = ["*"]
# list of all modules
modules = [
demography.Demography(resourcefilepath=self.resources),
enhanced_lifestyle.Lifestyle(resourcefilepath=self.resources),
healthsystem.HealthSystem(resourcefilepath=self.resources,
disable=True,
service_availability=service_availability),
symptommanager.SymptomManager(resourcefilepath=self.resources),
healthseekingbehaviour.HealthSeekingBehaviour(resourcefilepath=self.resources),
contraception.Contraception(resourcefilepath=self.resources),
labour.Labour(resourcefilepath=self.resources),
pregnancy_supervisor.PregnancySupervisor(resourcefilepath=self.resources),
]
return modules
if __name__ == '__main__':
# for running simulation locally only
scenario = EnhancedLifestyleScenario()
# For local testing, use shorter time and smaller population
scenario.end_date = Date(2015, 1, 1)
sceanrio.pop_size = 1_000
# run the scenario
output = scenario.run()
print(output)
We create sample metadata
tlo create-samples enhanced_lifestyle_scenario.py --seed 5 --draws 100
Draw configuration files follow pattern outputs/{scenario file name}/draw_{draw_index}/config.json
example sample metadata file: enhanced_lifestyle_scenario/draw_3/config.json
{
"scenario_seed": 5,
"path": "scripts/enhanced_lifestyle_analyses/enhanced_lifestyle_scenario.py",
"draw": 3,
"override_parameters": {
"tlo.methods.demography.Demography": {
"fraction_of_births_male": 0.509
},
"tlo.methods.contraception.Contraception": {
"r_init_year": 0.125,
"r_discont_year": 0.0670234079717601
}
}
}
You can then run a specific draw and sample, reading in the metadata json and then running the simulation with this data.
tlo run-samples enhanced_lifestyle_scenario/draw_1/config.json --samples 1-1000
After you are happy with a local run, you commit the scenario python file and push this to your branch.
You can log in to a dedicated node in azure, pull your branch, generate samples and run one to make sure that this is working correctly.
When you are ready to run an entire scenario use the tlo CLI:
tlo run-azure-scenario contraception_example.py --seed 70 --draws 100 --samples 1-1000 --branch stef/testing --commit-id 8b71b5be5f293387eb270cffe5a0925b0d97830f
(if no branch, master is used, if no commit id, latest commit is used)
This uses the configuration data in your repository to:
- create a job for the scenario
- on startup
- checkout the correct branch and commit
- run
tlo create-samples
with the seed and draws
- each node is assigned a task, or series of tasks (if we want a maximum number of nodes) to run an individual sample by the path to the json file
- After all tasks are complete, postprocessing task pools/zips the sample json files and output data frames
TLO Model Wiki