Skip to content

Commit 63f8eee

Browse files
committed
fix doc newdistrib
1 parent 14f8202 commit 63f8eee

File tree

17 files changed

+82
-167
lines changed

17 files changed

+82
-167
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,8 +96,11 @@ python train.py -data data/demo -save_model demo-model
9696

9797
The main train command is quite simple. Minimally it takes a data file
9898
and a save file. This will run the default model, which consists of a
99-
2-layer LSTM with 500 hidden units on both the encoder/decoder. You
100-
can also add `-gpuid 1` to use (say) GPU 1.
99+
2-layer LSTM with 500 hidden units on both the encoder/decoder.
100+
If you want to train on GPU, you need to set, as an example:
101+
CUDA_VISIBLE_DEVICES=1,3
102+
`-world_size 2 -gpu_ranks 0 1` to use (say) GPU 1 and 3 on this node only.
103+
To know more about distributed training on single or multi nodes, read the FAQ section.
101104

102105
### Step 3: Translate
103106

data/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@
44

55
> python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/data -src_vocab_size 1000 -tgt_vocab_size 1000
66
7-
> python train.py -data data/data -save_model /n/rush_lab/data/tmp_ -gpuid 0 -rnn_size 100 -word_vec_size 50 -layers 1 -train_steps 100 -optim adam -learning_rate 0.001
7+
> python train.py -data data/data -save_model /n/rush_lab/data/tmp_ -world_size 1 -gpu_ranks 0 -rnn_size 100 -word_vec_size 50 -layers 1 -train_steps 100 -optim adam -learning_rate 0.001

docs/source/FAQ.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -62,21 +62,22 @@ python train.py -save_model data/model \
6262
```
6363

6464

65-
## How do I use the Transformer model?
65+
## How do I use the Transformer model? Do you support multi-gpu?
6666

6767
The transformer model is very sensitive to hyperparameters. To run it
6868
effectively you need to set a bunch of different options that mimic the Google
6969
setup. We have confirmed the following command can replicate their WMT results.
7070

7171
```
72-
python train.py -data /tmp/de2/data -save_model /tmp/extra -gpuid 1 \
72+
python train.py -data /tmp/de2/data -save_model /tmp/extra \
7373
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \
7474
-encoder_type transformer -decoder_type transformer -position_encoding \
7575
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 \
7676
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \
7777
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
7878
-max_grad_norm 0 -param_init 0 -param_init_glorot \
79-
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -gpuid 0 1 2 3
79+
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
80+
-world_size 4 -gpu_ranks 0 1 2 3
8081
```
8182

8283
Here are what each of the parameters mean:
@@ -87,16 +88,17 @@ Here are what each of the parameters mean:
8788
* `batch_type tokens`, `normalization tokens`, `accum_count 4`: batch and normalize based on number of tokens and not sentences. Compute gradients based on four batches.
8889
- `label_smoothing 0.1`: use label smoothing loss.
8990

90-
* `gpuid 0 1 2 3 accum_count 2`: This will use 4 GPU and accumulate over 2 batches before updating parameters, this will emulate using 8 GPUS.
91-
92-
93-
## Do you support multi-gpu?
94-
95-
Yes !
91+
Multi GPU settings
9692
First you need to make sure you export CUDA_VISIBLE_DEVICES=0,1,2,3
97-
Then use -gpuid 0 1 2 3
9893
If you want to use GPU id 1 and 3 of your OS, you will need to export CUDA_VISIBLE_DEVICES=1,3
99-
then use -gpuid 0 1
94+
* `world_size 4 gpu_ranks 0 1 2 3`: This will use 4 GPU on this node only.
95+
96+
If you want to use 2 nodes with 2 GPU each, you need to set -master_ip and master_port, and
97+
* `world_size 4 gpu_ranks 0 1`: on the first node
98+
* `world_size 4 gpu_ranks 2 3`: on the second node
99+
* `accum_count 2`: This will accumulate over 2 batches before updating parameters.
100+
101+
if you use a regular network card (1 Gbps) then we suggest to use a higher accum_count to minimize the inter-node communication.
100102

101103
## How can I ensemble Models at inference?
102104

docs/source/Summarization.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,8 @@ python train.py -save_model models/cnndm \
9494
-copy_loss_by_seqlength \
9595
-bridge \
9696
-seed 777 \
97-
-gpuid X
97+
-world_size 2 \
98+
-gpu_ranks 0 1
9899
```
99100

100101
(2) CNNDM Transformer
@@ -129,7 +130,8 @@ python -u train.py -data data/cnndm/CNNDM \
129130
-share_embeddings \
130131
-copy_attn \
131132
-param_init_glorot \
132-
-gpuid 3
133+
-world_size 2 \
134+
-gpu_ranks 0 1
133135
```
134136

135137
(3) Gigaword

docs/source/extended.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ python preprocess.py -train_src data/multi30k/train.en.atok -train_tgt data/mult
2323
Step 2. Train the model.
2424

2525
```bash
26-
python train.py -data data/multi30k.atok.low -save_model multi30k_model -gpuid 0
26+
python train.py -data data/multi30k.atok.low -save_model multi30k_model -gpu_ranks 0
2727
```
2828

2929
Step 3. Translate sentences.

docs/source/im2text.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ python preprocess.py -data_type img -src_dir data/im2text/images/ -train_src dat
4343
2) Train the model.
4444

4545
```
46-
python train.py -model_type img -data data/im2text/demo -save_model demo-model -gpuid 0 -batch_size 20 \
46+
python train.py -model_type img -data data/im2text/demo -save_model demo-model -gpu_ranks 0 -batch_size 20 \
4747
-max_grad_norm 20 -learning_rate 0.1 -word_vec_size 80 -encoder_type brnn
4848
```
4949

docs/source/options/train.md

Lines changed: 20 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -116,15 +116,18 @@ Path prefix to the ".train.pt" and ".valid.pt" file path from preprocess.py
116116
Model filename (the model will be saved as <save_model>_epochN_PPL.pt where PPL
117117
is the validation perplexity
118118

119-
* **-gpuid []**
120-
Use CUDA on the listed devices.
119+
* **-world_size [1]**
120+
Total number of GPU processes accross several nodes.
121+
122+
* **-gpu_ranks []**
123+
Indices in the total number of procsses accross several nodes.
121124

122125
* **-seed [-1]**
123126
Random seed used for the experiments reproducibility.
124127

125128
### **Initialization**:
126-
* **-start_epoch [1]**
127-
The epoch from which to start
129+
* **-train_steps [100000]**
130+
Number of iterations (parameters update) for training
128131

129132
* **-param_init [0.1]**
130133
Parameters are initialized over uniform distribution with support (-param_init,
@@ -169,13 +172,13 @@ batch_size * accum_count batches at once. Recommended for Transformer.
169172
* **-valid_batch_size [32]**
170173
Maximum batch size for validation
171174

175+
* **-valid_steps [10000]**
176+
Run a validation every these steps
177+
172178
* **-max_generator_batches [32]**
173179
Maximum batches of words in a sequence to run the generator on in parallel.
174180
Higher is faster, but uses more memory.
175181

176-
* **-epochs [13]**
177-
Number of training epochs
178-
179182
* **-optim [sgd]**
180183
Optimization method.
181184

@@ -222,11 +225,17 @@ Starting learning rate. Recommended settings: sgd = 1, adagrad = 0.1, adadelta =
222225
If update_learning_rate, decay learning rate by this much if (i) perplexity does
223226
not decrease on the validation set or (ii) epoch has gone past start_decay_at
224227

225-
* **-start_decay_at [8]**
226-
Start decaying every epoch after and including this epoch
228+
* **-start_decay_steps [50000]**
229+
Start decaying after these steps
230+
231+
* **-decay_steps [10000]**
232+
Decay every these steps (after the start_decay_steps)
233+
234+
* **-save_checkpoint_steps [5000]**
235+
Save a checkpoint every these steps
227236

228-
* **-start_checkpoint_at []**
229-
Start checkpointing every epoch after and including this epoch
237+
* **-keep_checkpoint [-1]**
238+
Keep N last checkpoints. -1 = keep all.
230239

231240
* **-decay_method []**
232241
Use a custom decay rate.

docs/source/quickstart.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,11 @@ python train.py -data data/demo -save_model demo-model
3535

3636
The main train command is quite simple. Minimally it takes a data file
3737
and a save file. This will run the default model, which consists of a
38-
2-layer LSTM with 500 hidden units on both the encoder/decoder. You
39-
can also add `-gpuid 1` to use (say) GPU 1.
38+
2-layer LSTM with 500 hidden units on both the encoder/decoder.
39+
If you want to train on GPU, you need to set, as an example:
40+
CUDA_VISIBLE_DEVICES=1,3
41+
`-world_size 2 -gpu_ranks 0 1` to use (say) GPU 1 and 3 on this node only.
42+
To know more about distributed training on single or multi nodes, read the FAQ section.
4043

4144
### Step 3: Translate
4245

docs/source/speech2text.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ python preprocess.py -data_type audio -src_dir data/speech/an4_dataset -train_sr
2929
2) Train the model.
3030

3131
```
32-
python train.py -model_type audio -data data/speech/demo -save_model demo-model -gpuid 0 -batch_size 16 -max_grad_norm 20 -learning_rate 0.1 -learning_rate_decay 0.98 -train_steps 100000
32+
python train.py -model_type audio -data data/speech/demo -save_model demo-model -gpu_ranks 0 -batch_size 16 -max_grad_norm 20 -learning_rate 0.1 -learning_rate_decay 0.98 -train_steps 100000
3333
```
3434

3535
3) Translate the speechs.

onmt/inputters/inputter.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -488,8 +488,7 @@ def batch_size_fn(new, count, sofar):
488488
return max(src_elements, tgt_elements)
489489
else:
490490
batch_size_fn = None
491-
# device = opt.device_id if opt.gpuid else -1
492-
# breaking change torchtext 0.3
491+
493492
if opt.gpu_ranks:
494493
device = "cuda"
495494
else:

0 commit comments

Comments
 (0)