@@ -4,34 +4,90 @@ ResNet-N with TensorFlow and DALI
4
4
This demo implements residual networks model and use DALI for the data
5
5
augmentation pipeline from `the original paper `_.
6
6
7
- Common utilities for defining the network and performing basic training
8
- are located in the nvutils directory. Use of nvutils is demonstrated in
9
- the model scripts available in :fileref: `docs/examples/use_cases/tensorflow/resnet-n/resnet.py `.
10
-
11
- For parallelization, we use the Horovod distribution framework, which
12
- works in concert with MPI. To train ResNet-50 (``--layers=50 ``) using 8
13
- V100 GPUs, for example on DGX-1, use the following command
14
- (``--dali_cpu `` indicates to the script to use CPU backend for DALI):
15
-
16
- ::
17
-
18
- $ mpiexec --allow-run-as-root --bind-to socket -np 8 python resnet.py \
19
- --layers=50 \
20
- --data_dir=/data/imagenet \
21
- --data_idx_dir=/data/imagenet-idx \
22
- --precision=fp16 \
23
- --log_dir=/output/resnet50 \
24
- --dali_cpu
25
-
26
- Here we have assumed that imagenet is stored in tfrecord format in the
27
- directory '/data/imagenet'. After training completes, evaluation is
28
- performed using the validation dataset.
29
-
30
- Some common training parameters can tweaked from the command line.
31
- Others must be configured within the network scripts themselves.
32
-
33
- Original scripts modified from ``nvidia-examples `` scripts in `NGC
34
- TensorFlow Container `_
7
+ It implements the ResNet50 v1.5 CNN model and demonstrates efficient
8
+ single-node training on multi-GPU systems. They can be used for benchmarking, or
9
+ as a starting point for implementing and training your own network.
10
+
11
+ Common utilities for defining CNN networks and performing basic training are
12
+ located in the nvutils directory. The utilities are written in Tensorflow 2.0.
13
+ Use of nvutils is demonstrated in the model script (i.e. resnet.py). The scripts
14
+ support both Keras Fit/Compile and Custom Training Loop (CTL) modes with
15
+ Horovod.
16
+
17
+ To use DALI pipeline for data loading and preprocessing
18
+ ```
19
+ --dali_mode=GPU #or
20
+ --dali_mode=CPU
21
+ ```
22
+
23
+ ## Training in Keras Fit/Compile mode
24
+ For the full training on 8 GPUs:
25
+ ```
26
+ mpiexec --allow-run-as-root --bind-to socket -np 8 \
27
+ python resnet.py --num_iter=90 --iter_unit=epoch \
28
+ --data_dir=/data/imagenet/train-val-tfrecord-480/ \
29
+ --precision=fp16 --display_every=100 \
30
+ --export_dir=/tmp --dali_mode="GPU"
31
+ ```
32
+
33
+ For the benchmark training on 8 GPUs:
34
+ ```
35
+ mpiexec --allow-run-as-root --bind-to socket -np 8 \
36
+ python resnet.py --num_iter=400 --iter_unit=batch \
37
+ --data_dir=/data/imagenet/train-val-tfrecord-480/ \
38
+ --precision=fp16 --display_every=100 --dali_mode="GPU"
39
+ ```
40
+
41
+ ## Predicting in Keras Fit/Compile mode
42
+ For predicting with previously saved mode in `/tmp `:
43
+ ```
44
+ python resnet.py --predict --export_dir=/tmp --dali_mode="GPU"
45
+ ```
46
+
47
+ ## Training in CTL (Custom Training Loop) mode
48
+ For the full training on 8 GPUs:
49
+ ```
50
+ mpiexec --allow-run-as-root --bind-to socket -np 8 \
51
+ python resnet_ctl.py --num_iter=90 --iter_unit=epoch \
52
+ --data_dir=/data/imagenet/train-val-tfrecord-480/ \
53
+ --precision=fp16 --display_every=100 \
54
+ --export_dir=/tmp --dali_mode="GPU"
55
+ ```
56
+
57
+ For the benchmark training on 8 GPUs:
58
+ ```
59
+ mpiexec --allow-run-as-root --bind-to socket -np 8 \
60
+ python resnet_ctl.py --num_iter=400 --iter_unit=batch \
61
+ --data_dir=/data/imagenet/train-val-tfrecord-480/ \
62
+ --precision=fp16 --display_every=100 --dali_mode="GPU"
63
+ ```
64
+
65
+ ## Predicting in CTL (Custom Training Loop) mode
66
+ For predicting with previously saved mode in `/tmp `:
67
+ ```
68
+ python resnet_ctl.py --predict --export_dir=/tmp --dali_mode="GPU"
69
+ ```
70
+
71
+ ## Other useful options
72
+ To use tensorboard (Note, `/tmp/some_dir ` needs to be created by users):
73
+ ```
74
+ --tensorboard_dir=/tmp/some_dir
75
+ ```
76
+
77
+ To export saved model at the end of training (Note, `/tmp/some_dir ` needs to be created by users):
78
+ ```
79
+ --export_dir=/tmp/some_dir
80
+ ```
81
+
82
+ To store checkpoints at the end of every epoch (Note, `/tmp/some_dir ` needs to be created by users):
83
+ ```
84
+ --log_dir=/tmp/some_dir
85
+ ```
86
+
87
+ To enable XLA
88
+ ```
89
+ --use_xla
90
+ ```
35
91
36
92
Requirements
37
93
~~~~~~~~~~~~
@@ -41,7 +97,7 @@ TensorFlow
41
97
42
98
::
43
99
44
- pip install tensorflow-gpu==1.10.0
100
+ pip install tensorflow-gpu==2.3.1
45
101
46
102
OpenMPI
47
103
^^^^^^^
0 commit comments