- Title: Convolutional Networks with Adaptive Inference Graphs
- Authors: Andreas Veit, Serge Belongie
- Link: http://openaccess.thecvf.com/content_ECCV_2018/papers/Andreas_Veit_Convolutional_Networks_with_ECCV_2018_paper.pdf
- Tags: Neural Network, Gating, Gumbel, ECCV 2018
- Year: 2018
-
What
- They introduce a dynamic gating mechanism for ResNet that decides on-the-fly which layers to execute for a given input image.
- This improves performance, classification accuracy and robustness to adversarial attacks.
- Note that the decision of whether to execute a layer during training is stochastic, making it closely related to stochastic depth.
-
How
- Their technique is based on residual connections, i.e. on ResNet.
- They add a gating function, so that
x_l = x_{l-1} + F(x_{l-1})
becomesx_l = x_{l-1} + gate(x_{l-1}) * F(x_{l-1})
. - Gating
- The gating function produces a discrete output (
0
or1
), enabling to completely skip a layer. - They implement the gating function by applying the following steps to an input feature map:
- Global average pooling, leads to
Cx1x1
output. - Flatten to vector
- Apply fully connected layer with
d
outputs and ReLU - Apply fully connected layer with
2
outputs - Apply straight-through gumbel-softmax to output
- Global average pooling, leads to
- Straight-through Gumbel-softmax is a standard technique that is comparable to softmax,
but produces discrete
0
/1
outputs during the forward pass and uses a softmax during the backward pass to approximate gradients. - For small
d
, the additional execution time of the gating function is negligible. - Visualization:
- The gating function produces a discrete output (
- They also introduce a target rate loss, which pushes the network to execute
t
percent of all layers (i.e. lett
percent of all gates produce1
as the output):
-
Results
- CIFAR-10
- They choose
d
=16, adds 0.01% FLOPS and 4.8% parameters to ResNet-110. - At
t
=0.7 they observe that on average 82% of all layers are executed.- The model chooses to always execute downsampling layers. They seem to be important.
- When always executing all layers and scaling their outputs with their likelihood of being executed (similar to Dropout, Stochastic Depth), they achieve higher score than a network trained with stochastic depth. This indicates that their results do not just come from a regularizing effect.
- They choose
- ImageNet
- They compare gated ResNet-50 and ResNet-110.
- In ResNet-110 they always execute the first couple of layers (ungated), in ResNet-50 they gate all layers.
- Trade-off between FLOPs and accuracy (ResNet-50/101 vs their model "ConvNet-AIG" at different
t
): - Visualization of layerwise execution rate:
- Execution distribution and execution frequency during training:
- The model requires especially many layers to classify non-iconic views of objects (e.g. dog face from unusual perspective instead of whole dog from the side).
- When running adversarial attacks on ResNet-50 and their ConvNet-AIG-50 (via Fast Gradient Sign Attack) they observe that:
- Their model is overall more robust to adversarial attacks, i.e. keeps higher accuracy (both with/without protection via JPEG compression).
- The executed layers remain mostly the same in their model. They seem to not get attacked.
- CIFAR-10