Skip to content

Commit 1b0053d

Browse files
authored
Update ResNet50 example to work with TensorFlow 2.x (NVIDIA#2537)
* Update ResNet50 example to work with TF 2.x Signed-off-by: Janusz Lisiecki <[email protected]
1 parent de19da0 commit 1b0053d

File tree

17 files changed

+1795
-1000
lines changed

17 files changed

+1795
-1000
lines changed

docs/examples/use_cases/tensorflow/resnet-n/README.rst

Lines changed: 85 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -4,34 +4,90 @@ ResNet-N with TensorFlow and DALI
44
This demo implements residual networks model and use DALI for the data
55
augmentation pipeline from `the original paper`_.
66

7-
Common utilities for defining the network and performing basic training
8-
are located in the nvutils directory. Use of nvutils is demonstrated in
9-
the model scripts available in :fileref:`docs/examples/use_cases/tensorflow/resnet-n/resnet.py`.
10-
11-
For parallelization, we use the Horovod distribution framework, which
12-
works in concert with MPI. To train ResNet-50 (``--layers=50``) using 8
13-
V100 GPUs, for example on DGX-1, use the following command
14-
(``--dali_cpu`` indicates to the script to use CPU backend for DALI):
15-
16-
::
17-
18-
$ mpiexec --allow-run-as-root --bind-to socket -np 8 python resnet.py \
19-
--layers=50 \
20-
--data_dir=/data/imagenet \
21-
--data_idx_dir=/data/imagenet-idx \
22-
--precision=fp16 \
23-
--log_dir=/output/resnet50 \
24-
--dali_cpu
25-
26-
Here we have assumed that imagenet is stored in tfrecord format in the
27-
directory '/data/imagenet'. After training completes, evaluation is
28-
performed using the validation dataset.
29-
30-
Some common training parameters can tweaked from the command line.
31-
Others must be configured within the network scripts themselves.
32-
33-
Original scripts modified from ``nvidia-examples`` scripts in `NGC
34-
TensorFlow Container`_
7+
It implements the ResNet50 v1.5 CNN model and demonstrates efficient
8+
single-node training on multi-GPU systems. They can be used for benchmarking, or
9+
as a starting point for implementing and training your own network.
10+
11+
Common utilities for defining CNN networks and performing basic training are
12+
located in the nvutils directory. The utilities are written in Tensorflow 2.0.
13+
Use of nvutils is demonstrated in the model script (i.e. resnet.py). The scripts
14+
support both Keras Fit/Compile and Custom Training Loop (CTL) modes with
15+
Horovod.
16+
17+
To use DALI pipeline for data loading and preprocessing
18+
```
19+
--dali_mode=GPU #or
20+
--dali_mode=CPU
21+
```
22+
23+
## Training in Keras Fit/Compile mode
24+
For the full training on 8 GPUs:
25+
```
26+
mpiexec --allow-run-as-root --bind-to socket -np 8 \
27+
python resnet.py --num_iter=90 --iter_unit=epoch \
28+
--data_dir=/data/imagenet/train-val-tfrecord-480/ \
29+
--precision=fp16 --display_every=100 \
30+
--export_dir=/tmp --dali_mode="GPU"
31+
```
32+
33+
For the benchmark training on 8 GPUs:
34+
```
35+
mpiexec --allow-run-as-root --bind-to socket -np 8 \
36+
python resnet.py --num_iter=400 --iter_unit=batch \
37+
--data_dir=/data/imagenet/train-val-tfrecord-480/ \
38+
--precision=fp16 --display_every=100 --dali_mode="GPU"
39+
```
40+
41+
## Predicting in Keras Fit/Compile mode
42+
For predicting with previously saved mode in `/tmp`:
43+
```
44+
python resnet.py --predict --export_dir=/tmp --dali_mode="GPU"
45+
```
46+
47+
## Training in CTL (Custom Training Loop) mode
48+
For the full training on 8 GPUs:
49+
```
50+
mpiexec --allow-run-as-root --bind-to socket -np 8 \
51+
python resnet_ctl.py --num_iter=90 --iter_unit=epoch \
52+
--data_dir=/data/imagenet/train-val-tfrecord-480/ \
53+
--precision=fp16 --display_every=100 \
54+
--export_dir=/tmp --dali_mode="GPU"
55+
```
56+
57+
For the benchmark training on 8 GPUs:
58+
```
59+
mpiexec --allow-run-as-root --bind-to socket -np 8 \
60+
python resnet_ctl.py --num_iter=400 --iter_unit=batch \
61+
--data_dir=/data/imagenet/train-val-tfrecord-480/ \
62+
--precision=fp16 --display_every=100 --dali_mode="GPU"
63+
```
64+
65+
## Predicting in CTL (Custom Training Loop) mode
66+
For predicting with previously saved mode in `/tmp`:
67+
```
68+
python resnet_ctl.py --predict --export_dir=/tmp --dali_mode="GPU"
69+
```
70+
71+
## Other useful options
72+
To use tensorboard (Note, `/tmp/some_dir` needs to be created by users):
73+
```
74+
--tensorboard_dir=/tmp/some_dir
75+
```
76+
77+
To export saved model at the end of training (Note, `/tmp/some_dir` needs to be created by users):
78+
```
79+
--export_dir=/tmp/some_dir
80+
```
81+
82+
To store checkpoints at the end of every epoch (Note, `/tmp/some_dir` needs to be created by users):
83+
```
84+
--log_dir=/tmp/some_dir
85+
```
86+
87+
To enable XLA
88+
```
89+
--use_xla
90+
```
3591

3692
Requirements
3793
~~~~~~~~~~~~
@@ -41,7 +97,7 @@ TensorFlow
4197

4298
::
4399

44-
pip install tensorflow-gpu==1.10.0
100+
pip install tensorflow-gpu==2.3.1
45101

46102
OpenMPI
47103
^^^^^^^

docs/examples/use_cases/tensorflow/resnet-n/nvutils/__init__.py

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -14,25 +14,23 @@
1414
# limitations under the License.
1515
# ==============================================================================
1616

17-
from .optimizers import LarcOptimizer
18-
from .optimizers import LossScalingOptimizer
19-
from .builder import LayerBuilder
20-
from .var_storage import fp32_trainable_vars
21-
from .image_processing import image_set
2217
from .runner import train
23-
from .runner import validate
24-
from .cmdline import RequireInCmdline
18+
from .runner_ctl import train_ctl
19+
from .runner import predict
20+
from .runner_ctl import predict_ctl
2521
from .cmdline import parse_cmdline
2622
import os, sys, random
2723
import tensorflow as tf
28-
import horovod.tensorflow as hvd
24+
import horovod.tensorflow.keras as hvd
2925

3026
def init():
31-
gpu_thread_count = 2
32-
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
33-
os.environ['TF_GPU_THREAD_COUNT'] = str(gpu_thread_count)
34-
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1'
35-
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
27+
gpu_thread_count = 2
28+
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
29+
os.environ['TF_GPU_THREAD_COUNT'] = str(gpu_thread_count)
30+
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1'
31+
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
32+
hvd.init()
33+
if hvd.rank() == 0:
3634
print('PY', sys.version)
37-
print('TF', tf.__version__)
38-
hvd.init()
35+
print('TF', tf.version.VERSION)
36+

docs/examples/use_cases/tensorflow/resnet-n/nvutils/builder.py

Lines changed: 0 additions & 91 deletions
This file was deleted.

0 commit comments

Comments
 (0)