PyTorch Bolt is
- a simple PyTorch wrapper making multi-node multi-GPU training much easier on Slurm
PyTorch Bolt supports to
-
use single-node single-GPU training on a specified GPU device
-
use multi-node (or single-node) multi-GPU
DistributedDataParallel
(DDP) training- with
torch.distributed.launch
module - with Slurm cluster workload manager
- with
.
├── data
│ ├── __init__.py
│ └── customized_datamodule.py
├── model
│ ├── __init__.py
│ └── customized_model.py
├── main.py
├── main.sbatch
└── requirements.txt
MNIST classification using PyTorch Bolt (you might need to go through the relevant tutorials step by step).
pip install -r requirements.txt
can handle all package dependencies.
$ pip install pytorch-bolt
class pytorch_bolt.DataModule(data_dir='data', num_splits=10, batch_size=1, num_workers=0, pin_memory=False, drop_last=False)
Can be called to trigger DistributedSampler
when using DistributedDataParallel
(DDP).
Returns Dataloader
for trainset.
Returns Dataloader
for valset.
Returns Dataloader
for testset.
Returns argparse
parser. (Staticmethod)
Practical template:
import pytorch_bolt
class MyDataModule(pytorch_bolt.DataModule):
def __init__(self, args):
super().__init__(args)
# arguments for customized dataset
# optional helper function can be used
def _prepare_data(self):
pass
def _setup_dataset(self):
# trainset and valset for fit stage
# `self.num_splits` can be used for splitting trainset and valset
# testset for test stage
return trainset, valset, testset
@staticmethod
def add_argparse_args(parent_parser):
parser = argparse.ArgumentParser(parents=[parent_parser], add_help=False)
parser = pytorch_bolt.DataModule.add_argparse_args(parser)
# TODO
return parser
@classmethod
def from_argparse_args(cls, args):
return cls(args)
class pytorch_bolt.Module()
Returns model parameters that have requires_grad=True
.
Returns criterion.
Returns metric.
Returns optimizer (and learning rate scheduler).
Practical template:
import pytorch_bolt
class MyModel(pytorch_bolt.Module):
def __init__(self, args):
super().__init__()
# hyperparameters for model
self.model = self._setup_model()
# hyperparameters for criterion, metric, optimizer and lr_scheduler
def _setup_model(self):
# TODO
return model
def forward(self, inputs):
return self.model(inputs)
# return parameters that have requires_grad=True
# `parameters_to_update` can be useful for transfer learning
def parameters_to_update(self):
return
# return criterion
def configure_criterion(self):
return
# return metric
def configure_metric(self):
return
# return optimizer (and lr_scheduler)
def configure_optimizer(self):
return
@staticmethod
def add_argparse_args(parent_parser):
parser = argparse.ArgumentParser(parents=[parent_parser], add_help=False)
# TODO
return parser
@classmethod
def from_argparse_args(cls, args):
return cls(args)
class pytorch_bolt.Loggers(logs_dir='logs', loggerfmt='%(asctime)s | %(levelname)-5s | %(name)s - %(message)s', datefmt=None, tracker_keys=None (Required), tracker_reduction='mean')
Returns root
logger.
Returns root.child
logger.
Returns tracker for tracking forward propagation step outputs and statistics.
Returns progress bar for showing forward propagation step progress and details.
Returns Tensorboard writer for visualizing forward propagation epoch outputs.
Returns argparse
parser. (Staticmethod)
Loggers
constructor.
class pytorch_bolt.Trainer(loggers=None (Required), device=None, distributed=False, use_slurm=False, dist_backend='nccl', master_addr='localhost', master_port='29500', world_size=1, rank=0, local_rank=0, datamodule=None (Required), model=None (Required), max_epochs=5, verbose=False)
Gets rank of current process. (Staticmethod)
Fits the model on trainset, validating each epoch on valset.
Validates trained model by running one epoch on valset.
Tests trained model by running one epoch on testset.
Destroys trainer..
Returns argparse
parser. (Staticmethod)
Trainer
constructor.
Practical template for customized trainer:
import pytorch_bolt
class MyTrainer(pytorch_bolt.Trainer):
def _training_step(self, batch_idx, batch):
return
def _training_step_end(self, batch_idx, batch, step_outs):
return
# if return
# return dict, containing at least 2 keys: "loss", "score"
def _training_epoch_end(self):
return
- Inspired by Pytorch Lightning
WORLD_SIZE | SLURM_NTASKS (and SLURM_NPROCS for backwards compatibility)
Same as -n, --ntasks
RANK | SLURM_PROCID
The MPI rank (or relative process ID) of the current process
LOCAL_RANK | SLURM_LOCALID
Node local task ID for the process within a job.
MASTER_ADDR | SLURM_SUBMIT_HOST
The hostname of the machine from which sbatch was invoked.
NPROC_PER_NODE | SLURM_NTASKS_PER_NODE
Number of tasks requested per node. Only set if the --ntasks-per-node option is specified.
NNODES | SLURM_JOB_NUM_NODES (and SLURM_NNODES for backwards compatibility)
Total number of nodes in the job's resource allocation.
NODE_RANK | SLURM_NODEID
ID of the nodes allocated.
SLURM_JOB_NODELIST (and SLURM_NODELIST for backwards compatibility)
List of nodes allocated to the job.