Skip to content

Parallel GPU training #6

Open
Open
@james-daily

Description

@james-daily

Is parallel GPU training support possible? We would like to try this with a fairly large (multi-GB) dataset, but to make training time reasonable it would need to be done in parallel. Single node parallelism with DataParallel() would probably work for our use case, although the PyTorch documentation suggests that DistributedDataParallel() is preferred even for a single node.

Part of the motivation for this is that a large dataset needs a lot of memory, which in a cloud environment means a large, multi-GPU instance. It is very expensive to run such a large instance for weeks with all but one of the GPUs idle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions