6
6
`minGPT Training <../intermediate/ddp_series_minGPT.html >`__
7
7
8
8
9
- Multi GPU training with DDP
9
+ DDPλ₯Ό μ΄μ©ν λ€μ€ GPU νλ ¨
10
10
===========================
11
11
12
- Authors: `Suraj Subramanian <https://github.com/suraj813 >`__
12
+ μ μ: `Suraj Subramanian <https://github.com/suraj813 >`__
13
+ μμ: `Nathan Kim <https://github.com/NK590 >`__
13
14
14
15
.. grid :: 2
15
16
16
- .. grid-item-card :: :octicon:`mortar-board;1em;` What you will learn
17
+ .. grid-item-card :: :octicon:`mortar-board;1em;` μ¬κΈ°μμ λ°°μ°λ κ²
17
18
18
- - How to migrate a single- GPU training script to multi- GPU via DDP
19
- - Setting up the distributed process group
20
- - Saving and loading models in a distributed setup
19
+ - DDPλ₯Ό μ΄μ©νμ¬ λ¨μΌ GPU νμ΅ μ€ν¬λ¦½νΈλ₯Ό λ€μ€ GPU νμ΅ μ€ν¬λ¦½νΈλ‘ λ°κΎΈλ λ²
20
+ - λΆμ° νλ‘μΈμ€ κ·Έλ£Ή( distributed process group)μ μ€μ νλ λ²
21
+ - λΆμ° νκ²½μμ λͺ¨λΈμ μ μ₯ λ° μ½μ΄μ€λ λ²
21
22
22
23
.. grid :: 1
23
24
24
25
.. grid-item ::
25
26
26
- :octicon: `code-square;1.0em; ` View the code used in this tutorial on `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__
27
+ :octicon: `code-square;1.0em; ` μ΄ νν 리μΌμμ μ¬μ©λ μ½λλ `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__ μμ νμΈ κ°λ₯
27
28
28
- .. grid-item-card :: :octicon:`list-unordered;1em;` Prerequisites
29
+ .. grid-item-card :: :octicon:`list-unordered;1em;` λ€μ΄κ°κΈ° μμ μ€λΉν κ²
30
+
31
+ * `DDPκ° μ΄λ»κ² λμνλμ§ <ddp_series_theory.html >`__ μ λν μ λ°μ μΈ μ΄ν΄λ
32
+ * λ€μ€ GPUλ₯Ό κ°μ§ νλμ¨μ΄ (μ΄ νν 리μΌμμλ AWS p3.8xlarge μΈμ€ν΄μ€λ₯Ό μ΄μ©ν¨)
33
+ * CUDA νκ²½μμ `μ€μΉλ PyTorch <https://pytorch.org/get-started/locally/ >`__
29
34
30
- * High-level overview of `how DDP works <ddp_series_theory.html >`__
31
- * A machine with multiple GPUs (this tutorial uses an AWS p3.8xlarge instance)
32
- * PyTorch `installed <https://pytorch.org/get-started/locally/ >`__ with CUDA
33
-
34
- Follow along with the video below or on `youtube <https://www.youtube.com/watch/-LAtx9Q6DA8 >`__.
35
+ μλμ λΉλμ€ νΉμ `μ νλΈ <https://www.youtube.com/watch/-LAtx9Q6DA8 >`__ λ μ°Έκ³ ν΄μ£ΌμΈμ.
35
36
36
37
.. raw :: html
37
38
38
39
<div style =" margin-top :10px ; margin-bottom :10px ;" >
39
40
<iframe width =" 560" height =" 315" src =" https://www.youtube.com/embed/-LAtx9Q6DA8" frameborder =" 0" allow =" accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen ></iframe >
40
41
</div >
41
42
42
- In the ` previous tutorial <ddp_series_theory.html >`__, we got a high-level overview of how DDP works; now we see how to use DDP in code .
43
- In this tutorial, we start with a single- GPU training script and migrate that to running it on 4 GPUs on a single node .
44
- Along the way, we will talk through important concepts in distributed training while implementing them in our code .
43
+ ` μ΄μ νν λ¦¬μΌ <ddp_series_theory.html >`__ μμ, DDPκ° μ΄λ»κ² λμνλμ§μ λν΄ μ λ°μ μΌλ‘ μμ보μμΌλ―λ‘, μ΄μ μ€μ λ‘ DDPλ₯Ό μ΄λ»κ² μ¬μ©νλμ§ μ½λλ₯Ό λ³Ό μ°¨λ‘μ
λλ€ .
44
+ μ΄ νν 리μΌμμλ, λ¨Όμ λ¨μΌ GPU νμ΅ μ€ν¬λ¦½νΈμμ μμνμ¬, λ¨μΌ λ
Έλλ₯Ό κ°μ§ 4κ°μ GPUμμ λμνκ² λ§λ€ κ²μ
λλ€ .
45
+ μ΄ κ³Όμ μμ, λΆμ° νλ ¨(distributed training)μ λν μ€μν κ°λ
λ€μ μ§μ μ½λλ‘ κ΅¬ννλ©΄μ λ€λ£¨κ² λ κ²μ
λλ€ .
45
46
46
47
.. note ::
47
- If your model contains any ``BatchNorm `` layers, it needs to be converted to ``SyncBatchNorm `` to sync the running stats of ``BatchNorm ``
48
- layers across replicas.
49
-
50
- Use the helper function
51
- `torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) <https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm >`__ to convert all ``BatchNorm `` layers in the model to ``SyncBatchNorm ``.
48
+ λ§μ½ λΉμ μ λͺ¨λΈμ΄ ``BatchNorm `` λ μ΄μ΄λ₯Ό κ°μ§κ³ μλ€λ©΄, ν΄λΉ λ μ΄μ΄ κ° λμ μν©μ λκΈ°νλ₯Ό μν΄ μ΄κ±Έ λͺ¨λ ``SyncBatchNorm `` μΌλ‘ λ°κΏ νμκ° μμ΅λλ€.
52
49
50
+ λμ ν¨μ(helper function)
51
+ `torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) <https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm >`__ λ₯Ό μ΄μ©νμ¬ λͺ¨λΈ μμ ``BatchNorm `` λ μ΄μ΄λ₯Ό ``SyncBatchNorm `` λ μ΄μ΄λ‘ λ°κΏμ£ΌμΈμ.
53
52
54
- Diff for `single_gpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/single_gpu.py >`__ v/s `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__
53
+ `single_gpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/single_gpu.py >`__ μ `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__ μ μ°¨μ΄
55
54
56
- These are the changes you typically make to a single-GPU training script to enable DDP .
55
+ μ μ½λμ μ°¨μ΄λ₯Ό λΉκ΅νλ©΄μ μΌλ°μ μΌλ‘ λ¨μΌ GPU νμ΅ μ€ν¬λ¦½νΈμμ DDPλ₯Ό μ μ©νλ λ²μ μ μ μμ΅λλ€ .
57
56
58
- Imports
57
+ μν¬νΈ
59
58
~~~~~~~
60
- - ``torch.multiprocessing `` is a PyTorch wrapper around Python's native
61
- multiprocessing
62
- - The distributed process group contains all the processes that can
63
- communicate and synchronize with each other.
59
+ - ``torch.multiprocessing `` μ Pythonμ λ€μ΄ν°λΈ λ©ν°νλ‘μΈμ± λͺ¨λμ λνΌ(wrapper)μ
λλ€.
60
+
61
+ - λΆμ° νλ‘μΈμ€ κ·Έλ£Ή(distributed process group)μ μλ‘ μ 보 κ΅νμ΄ κ°λ₯νκ³ λκΈ°νκ° κ°λ₯ν λͺ¨λ νλ‘μΈμ€λ€μ ν¬ν¨ν©λλ€.
64
62
65
63
.. code-block :: diff
66
64
@@ -75,18 +73,15 @@ Imports
75
73
+ import os
76
74
77
75
78
- Constructing the process group
76
+ νλ‘μΈμ€ κ·Έλ£Ή ꡬμ±
79
77
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
80
78
81
- - First, before initializing the group process, call `set_device <https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html?highlight=set_device#torch.cuda.set_device >`__,
82
- which sets the default GPU for each process. This is important to prevent hangs or excessive memory utilization on `GPU:0 `
83
- - The process group can be initialized by TCP (default) or from a
84
- shared file-system. Read more on `process group
85
- initialization <https://pytorch.org/docs/stable/distributed.html#tcp-initialization> `__
86
- - `init_process_group <https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group >`__
87
- initializes the distributed process group.
88
- - Read more about `choosing a DDP
89
- backend <https://pytorch.org/docs/stable/distributed.html#which-backend-to-use> `__
79
+ - λ¨Όμ , κ·Έλ£Ή νλ‘μΈμ€λ₯Ό μ΄κΈ°ννκΈ° μ μ, `set_device <https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html?highlight=set_device#torch.cuda.set_device >`__ λ₯Ό νΈμΆνμ¬
80
+ κ°κ°μ νλ‘μΈμ€μ GPUλ₯Ό ν λΉν΄μ£ΌμΈμ. μ΄ κ³Όμ μ `GPU:0 ` μ κ³Όλν λ©λͺ¨λ¦¬ μ¬μ© νΉμ λ©μΆ€ νμμ λ°©μ§νκΈ° μν΄ μ€μν©λλ€.
81
+ - μ΄ νλ‘μΈμ€ κ·Έλ£Ήμ TCP(κΈ°λ³Έ) νΉμ 곡μ νμΌ μμ€ν
λ±μ ν΅νμ¬ μ΄κΈ°νλ μ μμ΅λλ€.
82
+ μμΈν λ΄μ©μ `νλ‘μΈμ€ κ·Έλ£Ή μ΄κΈ°ν <https://pytorch.org/docs/stable/distributed.html#tcp-initialization >`__ λ₯Ό μ°Έκ³ ν΄μ£ΌμΈμ.
83
+ - `init_process_group <https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group >`__ μΌλ‘ λΆμ° νλ‘μΈμ€ κ·Έλ£Ήμ μ΄κΈ°νμν΅λλ€.
84
+ - μΆκ°μ μΈ λ΄μ©μ `DDP λ°±μλ μ ν <https://pytorch.org/docs/stable/distributed.html#which-backend-to-use >`__ μ μ°Έκ³ ν΄μ£ΌμΈμ.
90
85
91
86
.. code-block :: diff
92
87
@@ -103,21 +98,21 @@ Constructing the process group
103
98
104
99
105
100
106
- Constructing the DDP model
101
+ DDP λͺ¨λΈ ꡬμΆ
107
102
~~~~~~~~~~~~~~~~~~~~~~~~~~
108
103
109
104
.. code-block :: diff
110
105
111
106
- self.model = model.to(gpu_id)
112
107
+ self.model = DDP(model, device_ids=[gpu_id])
113
108
114
- Distributing input data
109
+ μ
λ ₯ λ°μ΄ν° λΆμ°
115
110
~~~~~~~~~~~~~~~~~~~~~~~
116
111
117
- - `DistributedSampler <https://pytorch.org/docs/stable/data.html?highlight=distributedsampler#torch.utils.data.distributed.DistributedSampler >`__
118
- chunks the input data across all distributed processes .
119
- - Each process will receive an input batch of 32 samples; the effective
120
- batch size is ``32 * nprocs ``, or 128 when using 4 GPUs .
112
+ - `DistributedSampler <https://pytorch.org/docs/stable/data.html?highlight=distributedsampler#torch.utils.data.distributed.DistributedSampler >`__
113
+ λ₯Ό μ΄μ©νμ¬ λͺ¨λ λΆμ° νλ‘μΈμ€μ μ
λ ₯ λ°μ΄ν°λ₯Ό λλλλ€ .
114
+ - κ°κ°μ νλ‘μΈμ€λ 32κ° μν ν¬κΈ°μ μ
λ ₯ λ°°μΉλ₯Ό λ°μ΅λλ€.
115
+ μ΄μμ μΈ λ°°μΉ ν¬κΈ°λ ``32 * nprocs ``, νΉμ 4κ°μ GPUλ₯Ό μ¬μ©ν λ 128μ
λλ€ .
121
116
122
117
.. code-block :: diff
123
118
@@ -129,8 +124,8 @@ Distributing input data
129
124
+ sampler=DistributedSampler(train_dataset),
130
125
)
131
126
132
- - Calling the `` set_epoch() `` method on the `` DistributedSampler `` at the beginning of each epoch is necessary to make shuffling work
133
- properly across multiple epochs. Otherwise, the same ordering will be used in each epoch .
127
+ - λ§€ μν(epoch)μ μμλ§λ€ `` DistributedSampler `` μ `` set_epoch() `` λ©μλλ₯Ό νΈμΆνλ κ²μ λ€μμ μνμμ μμλ₯Ό μ μ ν μκΈ° μν΄ νμμ μ
λλ€.
128
+ μ΄λ₯Ό μ¬μ©νμ§ μμ κ²½μ°, λ§€ μνλ§λ€ κ°μ μμκ° μ¬μ©λ©λλ€ .
134
129
135
130
.. code-block :: diff
136
131
@@ -142,12 +137,12 @@ Distributing input data
142
137
self._run_batch(source, targets)
143
138
144
139
145
- Saving model checkpoints
140
+ λͺ¨λΈ 체ν¬ν¬μΈνΈ( checkpoints) μ μ₯
146
141
~~~~~~~~~~~~~~~~~~~~~~~~
147
- - We only need to save model checkpoints from one process. Without this
148
- condition, each process would save its copy of the identical mode. Read
149
- more on saving and loading models with
150
- DDP ` here < https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html#save-and-load-checkpoints >`__
142
+ - λͺ¨λΈ 체ν¬ν¬μΈνΈλ₯Ό μ μ₯ν λ, νλμ νλ‘μΈμ€μ λν΄μλ§ μ²΄ν¬ν¬μΈνΈλ₯Ό μ μ₯νλ©΄ λ©λλ€. μ΄λ κ² νμ§ μμΌλ©΄,
143
+ κ°κ°μ νλ‘μΈμ€κ° λͺ¨λ λμΌν μνλ₯Ό μ μ₯νκ² λ κ²μ
λλ€.
144
+ ` μ¬κΈ° < https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html#save- and-load-checkpoints >`__ μμ
145
+ DDP νκ²½μμ λͺ¨λΈμ μ μ₯κ³Ό μ½μ΄μ€κΈ° λ±μ λν΄ μμΈν λ΄μ©μ νμΈν μ μμ΅λλ€.
151
146
152
147
.. code-block :: diff
153
148
@@ -160,21 +155,19 @@ Saving model checkpoints
160
155
self._save_checkpoint(epoch)
161
156
162
157
.. warning ::
163
- `Collective calls <https://pytorch.org/docs/stable/distributed.html#collective-functions >`__ are functions that run on all the distributed processes,
164
- and they are used to gather certain states or values to a specific process. Collective calls require all ranks to run the collective code.
165
- In this example, `_save_checkpoint ` should not have any collective calls because it is only run on the ``rank:0 `` process.
166
- If you need to make any collective calls, it should be before the ``if self.gpu_id == 0 `` check.
167
-
158
+ `μ§ν© μ½(Collective Calls) <https://pytorch.org/docs/stable/distributed.html#collective-functions >`__ μ λͺ¨λ λΆμ° νλ‘μΈμ€μμ λμνλ ν¨μ(functions)μ΄λ©°,
159
+ νΉμ νλ‘μΈμ€μ νΉμ ν μνλ κ°μ λͺ¨μΌκΈ° μν΄ μ¬μ©λ©λλ€. μ§ν© μ½μ μ§ν© μ½λ(collective code)λ₯Ό μ€ννκΈ° μν΄ λͺ¨λ λν¬(rank)λ₯Ό νμλ‘ ν©λλ€.
160
+ μ΄ μμ μμ, `_save_checkpoint`λ μ€λ‘μ§ ``rank:0 `` νλ‘μΈμ€μμλ§ μ€νλκΈ° λλ¬Έμ, μ΄λ ν μ§ν© μ½λ κ°μ§κ³ μμΌλ©΄ μ λ©λλ€.
161
+ λ§μ½ μ§ν© μ½μ λ§λ€μ΄μΌ λλ€λ©΄, ``if self.gpu_id == 0 `` νμΈ μ΄μ μ λ§λ€μ΄μ ΈμΌ ν©λλ€.
168
162
169
- Running the distributed training job
163
+ λΆμ° νμ΅ μμ
μ μ€ν
170
164
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
171
165
172
- - Include new arguments ``rank `` (replacing ``device ``) and
173
- ``world_size ``.
174
- - ``rank `` is auto-allocated by DDP when calling
175
- `mp.spawn <https://pytorch.org/docs/stable/multiprocessing.html#spawning-subprocesses >`__.
176
- - ``world_size `` is the number of processes across the training job. For GPU training,
177
- this corresponds to the number of GPUs in use, and each process works on a dedicated GPU.
166
+ - μλ‘μ΄ μΈμκ° ``rank `` (``device `` λ₯Ό λ체)μ ``world_size `` λ₯Ό λμ
ν©λλ€.
167
+ - ``rank `` λ `mp.spawn <https://pytorch.org/docs/stable/multiprocessing.html#spawning-subprocesses >`__ μ νΈμΆν λ
168
+ DDPμ μν΄ μλμ μΌλ‘ ν λΉλ©λλ€.
169
+ - ``world_size `` λ νμ΅ μμ
μ μ΄μ©λλ νλ‘μΈμ€μ κ°μμ
λλ€. GPUλ₯Ό μ΄μ©ν νμ΅μ κ²½μ°μλ,
170
+ μ΄ κ°μ νμ¬ μ¬μ©μ€μΈ GPUμ κ°μ λ° ν GPUμ ν λΉλ νλ‘μΈμ€μ κ°μμ ν΄λΉν©λλ€.
178
171
179
172
.. code-block :: diff
180
173
@@ -199,11 +192,10 @@ Running the distributed training job
199
192
200
193
201
194
202
- Further Reading
195
+ λ μ½μ거리
203
196
---------------
204
197
205
- - `Fault Tolerant distributed training <ddp_series_fault_tolerance.html >`__ (next tutorial in this series)
206
- - `Intro to DDP <ddp_series_theory.html >`__ (previous tutorial in this series)
207
- - `Getting Started with DDP <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html >`__
208
- - `Process Group
209
- initialization <https://pytorch.org/docs/stable/distributed.html#tcp-initialization> `__
198
+ - `κ²°ν¨ νμ©(fault tolerant) λΆμ° μμ€ν
<ddp_series_fault_tolerance.html >`__ (λ³Έ μ리μ¦μ λ€μ νν 리μΌ)
199
+ - `DDP μ
λ¬Έ <ddp_series_theory.html >`__ (λ³Έ μ리μ¦μ μ΄μ νν 리μΌ)
200
+ - `λΆμ° λ°μ΄ν° λ³λ ¬ μ²λ¦¬(DDP) μμνκΈ° <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html >`__
201
+ - `νλ‘μΈμ€ κ·Έλ£Ή μ΄κΈ°ν <https://pytorch.org/docs/stable/distributed.html#tcp-initialization >`__
0 commit comments