Skip to content

Latest commit

 

History

History
1644 lines (1093 loc) · 79.9 KB

2017-IIIT-Summer-School-Computer-Vision.md

File metadata and controls

1644 lines (1093 loc) · 79.9 KB

20170703

C. V. Jawahar - Intro & Background (09:30 to 11:00)

  • Obj: to understand and learn the recent-ish advances in the CV scene in the world

  • Last year summer school: basics of deep learning, touched CV; this year, 2 summer schools: one for CV; the other for Deep Learning (with some overlap)

  • Typical day plan: Breakfast, Session 1, Break, Session 2, Lunch, Session 3a, Session 3b, Break, Demo/La/Tutorial, Lab/Practice & Quiz, Dinner

  • CV Goal - To make computers understand images and videos

  • Scene classification (outdoor, lakeside...), Object Classification (is that a car, or...), Object Detection (where is the car, ...), Semantic Segmentation (in what all pixels is the car, ...), Pose Estimation (which direction is the car facing, ...)

  • Fei-Fei, Koch and Perona, What do we perceive in a glance of a real-world scene?, Journal of Vision 2007

  • Image -> Text: Image Annotation, Image Caption Generation, Image Description

  • History of CV - face detection, ...

  • Recognition: Classification (Instance recognition, Category recognition, Fine-grain classification), ... , Labels

  • Challenges : Occlusions, truncations, scale/size, articulation, Inter-class similarity, Intra-class variation

  • Variations in problems: Binary Classification, Multi-class, Multi-Label, Multi-output

  • Feature extraction: I -> X, Classification: X -> Y; End-to-end: can we do I -> Y?

  • Caltech 101 (2003): dataset for basic-level classification; objects from 101 classes; considered a toy dataset now; possibly gained high accuracies quickly because images were captured for the purpose of classification

  • PASCAL VOC (2005-2012): 20 object classes, 22,591 images; multiple tasks: Classification, Detection, Segmentation; van Gool, Zisserman, IJCV 2015

  • ImageNet (ILSVRC) (2010): 1000 object classes, 14,197,122 images; Classification as Top-5, Karpathy, Fei-Fei, IJCV 2015

  • COCO: harder than ImageNet; 80 object classes, 300,000 images; Describe Images, Human Keypoints

  • More datasets...

  • Classification evaluation: Overlapped area between true bounding box and predicted? Intersection? Intersection/Union?

  • Basic Detection: Image -> features of every possible rectangle -> rectangle with max probability of class; Update: region proposal

  • Evaluation metric: Average Precision (AP), Mean AP (Precision averaged over all thresholds of classification)

  • Success of classification/detection: ML, data, computation (GPUs)

HISTORY OF VISION: low-level features

- Extremely low-level vision: filtering
- Edges: Canny (1968), Sobel, Prewitt, ...
- Textures: Viola-Jones (2001), ...
- Histograms: [SIFT (Lowe, 1999)](https://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/lowe_ijcv2004.pdf), Shape contexts (Malik, 2001), Spacial Pyramid Matching (Lazebnik, Schmid, Ponce, 2006)](http://www.vision.caltech.edu/Image_Datasets/Caltech101/cvpr06b_lana.pdf), [DPM based on HOG (Felzenswalb et al., 2010)](https://cs.brown.edu/~pff/papers/lsvm-pami.pdf)
- Bag of Words: histograms happened in Text domain, so we brought them to images, like histograms of textures
- Bag of Visual Words: histogram of predefined visual textures (Visual Words)
- [Bag of Words (Zisserman, 2003)](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), SIFT (Lowe, 1999, [2004]((https://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/lowe_ijcv2004.pdf))), [HOG+SVM (Dalal and Triggs, 2005)](http://www.csd.uwo.ca/~olga/Courses/Fall2009/9840/Papers/DalalTriggsCVPR05.pdf)

HISTORY OF VISION: Mid-level features

- Semantic segmentation

HISTORY OF VISION: High-level features

- Deep Learning: you can learn low-level to mid-level to high-level features automatically without manual intervention
- One-Hot to Rich Representations: [word2vec](https://arxiv.org/abs/1301.3781) in text (Mikolov, 2013)
- CBOW: Given a sequence of words, can you predict the missing middle word? Skip-gram: Given one word, can you predict the sequence of words before and after?

SEGMENTATION

- As Clustering: group similar pixels together (unsupervised), distance based on color and position; not really acceptable anymore for any decent segmentation
- K-Means, [Normalized Graph Cut](https://people.eecs.berkeley.edu/~malik/papers/SM-ncut.pdf) (Malik, 2006)
- Graph Cut: Label Pixels as Background (Source) or Object (Sink), make that as a graph, cut the graph so that there is no path between Source and Sink; use MRF, etc. for making and cutting the graph
- Graph Cut by Energy Minimization: pairwise constraint on pixel values
- [Grab Cut (Rother et al., 2004)](https://cvg.ethz.ch/teaching/cvl/2012/grabcut-siggraph04.pdf) using Iterated Graph Cuts: User initialization, iterate: learn foreground, learn background; user initialization provides supervision
- Superpixels: group pixels together, now apply same techniques by assigning labels to superpixels instead of pixels
- Semantic Segmentation: Class Segmentation (where are persons?), Instance Segmentation (class: persons, segment boundaries of each person), Segmentation from expression ("wearing blue")

C. V. Jawahar - Intro to DL for CV (11:30 to 12:45)

  • Linear Classifiers

  • Nearest neighbours

*Cholesky Decomposition

  • Nearest neighbours for image annotation: "A new baseline for image annotation"

  • SVM

    • Max-margin classifier
    • Key words: Training, testing, generalization, error, complexity (number of parameters), generative classifiers, discriminative classifiers
- After such classifiers came AlexNet which suddenly improved error from ~25% to ~15%
- Appreciate what Alex did, it was very difficult to do the first time
  • Over the years, deeper networks brought the ImageNet error rate down to human level

  • LeNet (1989) - LeNet (1998) - AlexNet (2012) comparison

Neural Networks

  • MLP

  • Back-propagate through the networks using gradient descent

  • CNN: Locally connected networks with shared weights

    • FC (too many weights/parameters) -> Locally connected filters (many many weights, one set per, say, 3x3 filter going over the original image) -(BIG JUMP)> use shared weights to convolve over an image (much lesser number of parameters) => Convolutional layer with 1 feature map -> Convolutional layer with multiple feature maps
    • It is observed that the filters learned this way are similar to the filters we had manually tried to make many years ago
    • Pool: Shrink the output size by choosing only some of the outputs
    • Stride: Jump pixels to reduce number of parameters
  • Activation functions: Pass the output of the CNN through a non-linearity to generalize

  • Stack such layers together - ...=-ConvPoolNorm-ConvPoolNorm-...; this behaves similar to an MLP

  • This is what Alex did. “ImageNet Classification with Deep Convolutional Neural Networks

  • CNN features are generic: Now, we can use the same network, remove the last classification layer, and use the features learnt till the penultimate layer to classify other object categories!

    • Train the CNN on a very large dataset like ImageNet
    • Reuse the CNN to solve smaller problems by removing the last (classification) layer

Fine tuning

- Extend to more classes (eg. from 1000 classes to another new 100 classes)
- Extend to new tasks (eg. from object classification to scene classification) (Transfer Learning)
- Extend to new datasets (eg. from ImageNet to PASCAL)

Transfer Learning

- People tried to see why same features could be used for different tasks
- [“How transferable are features in deep neural networks?”, Bengio, NIPS 2014](https://arxiv.org/abs/1411.1792)
  • Other popular deep architectures: Autoencoder, RBM, RNN, …

  • Summary

Girish Verma - AlexNet and Beyond (13:30 to 15:30)

![alt text](https://kratzert.github.io/images/finetune_alexnet/alexnet.png “AlexNet”)

![alt text](http://everglory99.github.io/Intro_DL_TCC/intro_dl_images/dropout1.png “Dropout”)

- In each iteration, randomly choose some weights to zero out their outputs
- Train the network
- At testing time, use all neurons, don’t zero
- So, each neuron does not depend a lot on other neurons, we are eliminating such dependencies
- Obviously, it takes more number of epochs to achieve the same training accuracy as that without Dropout, but the testing accuracy increases very well
  • Dropout was used in AlexNet

ZFNet (ImageNet 2013 Winner)

alt text

- Looks very similar to AlexNet, but they tried to interpret the feature activity in the intermediate layers
- They visualized the outputs of each layer
- ZFNet was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers
- Also, they did Input Jittering
- #### Input Jittering
    - Scale down the largest dimension to 256
    - Consider 5 images as sub-images of this 256-sized image of 224x224 size
    - Flip the images to make a total of 10 images
    - Use all of these to train the network
- Winner of ImageNet localization task 2013
- Training to classify, locate and detect objects improves accuracy of all three
- No need to jitter, takes care of cropping and scaling within the network itself

GoogLeNet/Inception (runner up in ImageNet 2014)

![alt text](http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/img/googlenet-arch_1228x573.jpg “GoogLeNet/Inception”)

- They simply doubled the number of hidden layers to 22
- They saw that the FC layers contain the most number of parameters. So they reduced the number of parameters in the FC layer to the bare minimum, instead compensating via the convolutional layers
- They carefully designed convolutional layers called Inception modules
- #### Inception Modules

![alt text](https://cpmajgaard.com/blog/assets/images/parking/inception.jpg “Inception module”)

    - Each inception module is performing 1x1, 3x3, 5x5 and 3x3 with max pooling simultaneously
- Stack Inception modules together

VGGNet (winner of ImageNet 2014)

- Deeper is better philosophy: multiple 3x3 filter layers have the same effect as 5x5 or 7x7 or bigger filters
- Uses 19 layers

ResNet (winner of ImageNet 2015)

![alt text](https://qph.ec.quoracdn.net/main-qimg-cf89aa517e5b641dc8e41e7a57bafc2c “ResNet”)

- Network which won the competition was 110 layers deep
- Deep networks have vanishing gradient problem. ResNet overcomes this using Skip connections
- Skip connections: simply add gradients from a much further layer with those from regular backprop

Ensembles

- Train multiple networks and take a majority vote
  • Batch Normalization (from CS231n lecture slides)

CLASSIFICATION + LOCALIZATION (single object)

  • Classification: get the object label, Localization: get the bounding box of the object
    • Use a classification network to classify the object
    • Use another network to get the bounding box

DETECTION

  • Task: put a bounding box on every instance of any class

![alt text](https://qph.ec.quoracdn.net/main-qimg-c96241e4e90c2b8509c4b1e87965965a “R-CNN”)

- Image -> Extract Region Proposals -> Compute CNN features for each of the proposed region -> Classify regions
- But, many region proposals is a problem
- Operation can be optimized by passing full image through CNN, and then looking at the proposed region’s outputs from CNN

Semantic Segmentation

- We can make the output layer have same dimensions as input image, and value/third dimension with class label
- We need to unpool to get the same image size
- Used in [DeconvNet](http://arxiv.org/abs/1505.04366) we convolve and then deconvolve to get the same image size as the output size

![alt text](http://cvlab.postech.ac.kr/research/deconvnet/images/overall.png “DeconvNet”) - Good link!

Instance Segmentation

- Each instance of a class is labeled separately

Sequence-to-Sequence problems

Recurrent Neural Network

- Convert x_1, x_2, … , x_s to o_1, o_2, … , o_t
- RNNs are designed to remember long-term dependencies

Lab Session - Praveen

  • Use PyTorch
  • CNNs, ResNet

20170704

Chetan Arora (IIIT Delhi) - Detection (09:30 to 12:20)

  • Detection: Object Location, Object Attributes

  • Applications: Instance Recognition, Assistive Vision, Security/Surveillance, Activity Recognition (in videos)

  • Challenges: Illumination, occlusions, background clutter, intra-class variation

Object Recognition Ideas: Historical

- Geometrical era
    - Fit model to a transformation between pairs of features
    - Machine Perception of Three Dimensional Solids, a PhD Thesis in 1963
    - It’s invariant to camera position, illumination, internal parameters
    - Invariant to similarity Tx of four points
    - But, intra-class variation is a big problem
- Appearance Models
    - Eigenfaces ([Turk and Pentland, 1991](http://www.face-rec.org/algorithms/PCA/jcn.pdf))
    - Other appearance manifolds
    - Requires global registration of patterns
    - Not even robust to translation, forget about occlusions, clutter, geometric Tx
- Sliding Window
    - Like, Haar wavelet by Viola Jones on a sliding window ([Viola and Jones, 2001](http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf))
    - THE method until deep learning came, because it’s pretty fast
    - But, we need non-maximal suppression
- Local Features
    - SIFT (Lowe, 1999, 2004), SURF
    - Bag of Features: extract features -> bag them into words -> match stuff through words, not features
    - Spatial orientation is ignored
- Parts-Based Models
    - Model: object as a set of parts, noting relative locations between parts, appearance of each part
    - [Fischer and Elschlager, 1973](http://dl.acm.org/citation.cfm?id=1309318)
    - Constellation model: recognizing larger parts in objects
    - Discriminatively-trained PBM: eg. HoG ([Ramanan, PAMI 2009](https://cs.brown.edu/~pff/papers/lsvm-pami.pdf)); can recognize object even if whole object can’t be seen
  • Present ideas: global+local features, context-based, deep learning

Object Detection with Deep Neural Networks

  • ImageNet: excellent competition for image classification

  • AlexNet (2012) won in 2012, changed the game

  • CNN: Local spatial information is transmitted through the layers; fine-level information in channels, location-level in neurons themselves

  • Convolutional layers are translation-invariant

  • Responses from the layers are composed: one layer fires when it sees a head, a shoulder, the next layers fires when it sees both, etc.

  • Convolutional layers can work with any image size, it’s just that the layers can be scaled accordingly

  • HoG by Convolutional Layers:

    • HoG: Compute image gradients -> Bin gradients in 18 direction -> Compute cell histograms -> Normalize cell histograms
    • CNN: Learn edge filters -> Apply directional filters + gating (non-linearity) -> Sum/Average Pool -> LRN
    • So the same steps of HoG can be equivalently produced by a CNN, and even better because CNN can decide the bins, and other steps that were hitherto hand-engineered
  • HoG, Dense SIFT, and many other “hand-engineered” features are convolutional feature maps

  • Features in feature maps (one channel of any convolutional layer) can be back-tracked to a visual feature in the original image

  • What if we apply CNNs to object classification + localization?

CLASSIFICATION + LOCALIZATION

  • We could slide a window on the original image, input each image into, say, AlexNet, and find the window with max firing of object, say cat

  • But Sliding Window approach in this case is computationally expensive, because too many windows (Viola-Jones, etc. are not computationally expensive)

  • Localization as a regression problem:

    • Using AlexNet, in addition to a last layer to predict the object class, use another last layer to predict the 4 numbers of a bounding box
    • Total loss = Softmax loss of object class label + L2 loss of bounding box
  • But this fails when there are multiple objects, or multiple instances of the same object!

  • So let’s try to reduce the number of windows using Region Proposals

    • Extract Region Proposals -> Resize each region to standard size -> Find Convolutional features -> Classify each region (maybe using SVM), and do linear regression for bounding box within proposed region
    • Training for R-CNN is not very accurate at first, and it is very slow
    • Ad-hoc training objectives (use an AlexNet pre-trained with ImageNet):
      • Fine-tune network with softmax classifier (log loss)
      • Train post-hoc linear SVMs (hinge loss)
      • Train post-hoc bounding box regressing (least squares)
    • But, this takes a looooong time to train, and a long time to infer (the class and bounding box)
    • Instead of cropping the original image by the proposed regions and passing them through the network, pass the entire image, and then crop the extracted feature (say, at Conv5, the last convolutional layer in AlexNet) according to the region proposals!
    • But the extracted cropped-ish feature (feature cropping is not advisable, features are different from images) is of variable size depending on the size of the proposed region
    • So use Bag-of-Words and Spatial Pyramid Matching (SPM) to extract a uniform-sized vector at the end of the convolutional layers to input to the Dense layers
    • SPM: Make grids with pyramidally increasing size, pool the features into the grids, use these grids as the input to the Dense layers
    • This is what SPP Net [He et al., ECCV14] did
    • So SPP Net fixes the inference speed issue with R-CNN
    • But, we cannot pass gradients through the Spatial Pyramids! So link is broken, backprop can’t be used
    • Instead of spatial pyramids, use only 1 scale each, and do ROI Pooling
    • ROI Pooling layer is differentiable! So gradients can pass through it
    • Hierarchical Sampling: Choose shuffled ROIs from the SAME image in a minibatch while running SGD; this is because, while back-propagating, the weights corresponding to regions outside of the ROI also get updated, so it is better if we can update weights form the same image
  • Also, in case of Fast R-CNN/SPP Net, the features at Conv5 have information from the surrounding area as well, while those in R-CNN don’t. This is a good thing, since the surrounding area provides context.

  • So time is fine, all that’s left is Region Proposals

    • Anchors: Pre-defined reference boxes - different aspect ratios, and different scales
    • Propose regions from the feature map instead of the image! Using a separate convolutional network called Region Proposal Network for this
    • At each pixel, generate suggestions for box coordinates (x, y, w, h), based on the anchors, such that an object is present within that box; use IoU to select and reject boxes
    • Extract the most probable boxes containing objects using the Region Proposal Network
    • Use THESE proposed regions, do ROI Pooling and the rest to extract objects and their bounding boxes
    • Divide the image into 7x7 cells
    • At each cell, use a single network to generate 2 bounding boxes (based on anchors) and class probabilities (instead of an RPN like in Faster R-CNN)
    • Use the ground truth bounding box to increase the probability of the bounding box closer to it, and decrease that of the other one
    • Use Non-Maximal Suppression to eliminate boxes
    • But, multi-scale is not taken care of
    • Multi-scale feature maps
    • Data augmentation
    • Also, while training, assign GT box to all unassigned generated boxes with IoU > 0.5

Vineeth Balasubramanian (IIT Hyderabad) - GANs and VAEs (13:20 to 16:40)

Introduction to Generative Models

  • Recognizing objects is fine, a harder problem is to imagine!

  • Eg. Handwriting generation, colorize b/w

  • DIscriminative models: anything that tries to draw a boundary

  • Generative models: HMM, etc. - Assume data is generated from an underlying Gaussian distribution, learn the mean and variance of that Gaussian model

GANs

  • Developed by Ian Goodfellow in 2014: https://arxiv.org/abs/1406.2661

  • Generator: Noise -> Sample

  • Discriminator: Binary classifier - is input data REAL or FAKE?

  • DCGAN (Radford et al, 2015) uses CNNs as Generator and Discriminator

  • Generator and Discriminator play against each other, the Generator trying to cheat the Discriminator, and the Discriminator getting better at catching the Generator

  • Generator G tries to: a) maximize D(G(z)), or minimize (1 - D(G(z))) b) minimize D(x)

    • Combine a) and b): minimize D(x) + (1 - D(G(z)))
  • Discriminator D tries to: a) maximize D(x) b) minimize D(G(z))

    • Combine a) and b): maximize D(x) + (1 - D(G(z)))
  • So, min_G max_D [D(x) + (1 - D(G(z)))] (Same equation!)

  • Loss in a GAN never comes down, it keeps oscillating. Because there are two players playing against each other.

  • We don’t have a good metric to know when to stop training. If it looks good to your eyes, it’s probably time to stop.

  • Pseudocode from https://arxiv.org/abs/1406.2661

  • Maximizing likelihood = Minimizing MSE

  • Illustration of G imitating the real distribution

  • Pitfalls of GAN: No indicator when to finish training, Oscillation, Mode collapsing

Hacks of DCGANs by Soumith Chintala

- Normalize image b/w -1 and 1
- Use Tanh
- Don’t sample from uniform, sample from Gaussian (spherical)
- Use BatchNorm. If BatchNorm is not an option, use InstanceNorm.
- Avoid sparse gradients
    - Stability of GAN suffers
    - Use LeakyReLU instead
- Use soft/noisy labels (0.7-1.2 instead of 1, 0.0-0.2 instead of 0)
- Occasionally flip the labels given to the discriminator
- Use SGD for D, and Adam for G

Variations of GANs

- [Vanilla GAN [Radford et al., 2014]](https://arxiv.org/abs/1406.2661)
- [Conditional GAN [Mirza and Osindera, 2014]](https://arxiv.org/abs/1411.1784): Give the class label also to D and G. Perhaps avoids mode collapsing.
- [Bidirectional GAN [Donahue et al., 2016]](https://arxiv.org/abs/1605.09782)
- [Semi-supervised GAN [Salimans et al, 2016]](https://arxiv.org/abs/1606.03498): Give class label to D while training, get the class label of the input image from D, in addition to REAL or FAKE
- [Info GAN [Chen et al., 2016]](https://arxiv.org/abs/1606.03657): Give class label only to G, get class label as well from D; can generate 3D faces with varying Azimuth (pose), Elevation, Lighting, Wide/Narrow
- [Auxillary GAN [Odena et al., 2016]](https://arxiv.org/abs/1610.09585)
- <Man with glasses> - <Man without glasses> + <Woman without glasses> = <Woman with glasses>
- What’s interesting is we have mapped images to a vector space that is continuous and awesome enough to be able to do such vector operations
- Super-resolved blurred images
- Introduced Perceptual Loss = Content Loss + Adversarial Loss
- Updates manifold with user interaction
- Adds a loss with the manifold, in addition to Adversarial Loss
- Noise Vector -> Deconvolution to generate video for Foreground, to generate image for Background -> Combine foreground and background to make video

Resources for GANs:

- [List of GANs](https://github.com/nightrome/really-awesome-gan)
- [GAN zoo](https://github.com/hindupuravinash/the-gan-zoo)
- Ian Goodfellow

Qs: Generated images from uniform range? Why use Gaussian?

VAEs

  • Autoencoder is a network that tries to predict its input itself

  • Use is, the hidden layer can be smaller than the input, so a smaller representation can be learnt

  • If input is completely random, then this compression is very difficult

  • Denoising Autoencoder: during training, willfully add noise, and denoise it using the autoencoder.

  • MANIFOLD

    • A low-level representation of a higher-dimensional data
    • ML experts try to find the lower-dimensional representation of just about everything
  • Deep Autoencoders - Salakhutdinov and Hinton, Science, 2006 is the paper that revolutionized Deep Learning.

    • People used to use RBMs with pre-training as an initialization
    • That’s when deep networks got noticed
    • Now, pre-training is redundant because we now use glorot init and xavier init
  • Probabilistic Graphical Models: P(x, z) = P(z)*P(X|Z), if X is in the next layer of a network with input Z

  • VAE loss function: KL Divergence + Reconstruction Loss

  • Reparametrization trick - introduced in 2014 - so VAE becomes back-propagatable

  • Attention in Deep Learning for Vision: RNNs for captioning - Show, Attend and Tell, CS231n lectures

  • (DRAW) Deep Recurrent Attentive Writer: generates images in phases; youtube

  • Attention Mechanism: Spatial Transformer Networks, youtube

    • Tx the squashed MNIST digit into a more regular form using the STN, then input that into any MNIST digit recognizer
  • Sync-DRAW with Captions - currently accepted at ACM

  • PhysNet - try to learn physical laws from images and predict them

[email protected]

20170705

Karteek Alahari (INRIA Grenoble) - Semantic Segmentation (09:00 - 10:30)

  • Working in INRIA Grenoble

  • Camvid (Brostow et al., PRL’08): possibly the first autonomous driving effort

  • Better is to do object classification (which object is present), and object detection (localizes objects to bounding boxes) (Some papers...)

  • Even better is Semantic Segmentation, to achieve pixel-level labelling (Some papers...)

  • Goal: pixels in, pixels out

  • Eg. Monocular depth estimation (Liu et al., 2015), boundary detection (Xie & Tu, 2015)

  • Long history: Liebe ^ Schiele, 2003; He et al., 2004,

  • More recently: Deeplab [Chen et al., 2015], Long et al, [Pathak et al], CRF as RNN

Higher order CRF

  • Pose the problem of Semantic Segmentation onto a graph: let each pixel be represented as a node on a graph (4 or 8 neighbourhood), each pixel can carry a semantic label (road, or car, ...)

  • For each assignment, there is a cost called Unary Potential

    • Unary potential

    • 𝜓_i(x_i), cost of the class label
    • TextonBoost Shotton et al., ECCV 2006

    • Each feature is denoted by a pair: (rectangle r, texton t) == whether a region r contains the texton t
    • Texton: convolve with a set of filters (Paper used Gabor filters), cluster them (pixel-wise?) to make a Texton Map
    • Feature response: count of texton instances
  • Better: Use feature type as another element - (rectangle r, feature f, texton t), where f belongs to {SIFT, HOG, etc.}

  • Comparison of Brostrow and Unary Potential: we see that using Unary Potential performs better

    • Pair-Wise Potential

    • 𝜓_{i,j}(x_i, x_j), cost of adjacent pixels not having the same label
    • Contrast-Sensitive Potts model: taking adjacency into account
  • Texton Boost: Unary + Pairwise!

  • Better results with +Pairwise than just unary

    • Segment-based Potentials

  • Single segmentation: meanshift?

    • Very good at making superpixels
    • Not very good at fine-level segmentation
  • Combine multiple segmentations: combine multiple levels of sensitivity in segmenting an image

  • To do this, we introduce Clique Potential

    • Clique Potential

    • 𝜓_c(x_c), cost of a clique/superpixel
    • Cliques shall have higher cost if they have multiple labels in them
    • This encourages label consistency within a clique
    • One version: Robust $P^{N}$ model - 𝜓_c(x_c) = N_i(x_c)*(1/Q)*𝛾_max if N_i(x_c)<=Q, 𝛾_max otherwise
    • 𝛾_max is label inconsistency
  • Even better results with +HO than just Unary + Pairwise

  • Unary + Pairwise + HO: BMVC ‘09

  • SO FAR: what objects are in the scene, where are the objects; what about, how many objects?

  • Also, the HO result has missed thin objects

  • Maybe one way is to do object detection and count the number of object instances

Detector-driven Segmentation

  • Imposes hard constraint, cannot recover from false detections

    • Detector Potential

    • Detector potential = min_{y_d}(Strength of detection hypothesis + inconsistency cost (e.g., for occluded objects)), where y_d is all the possible segmentations
  • We need a smarter way of computing y_d

  • So: Alpha expansion, Graph cuts make the cost function simpler

  • Using Detector Potential is able to combine multiple sliding window detections to eliminate some boxes, and extract thin objects

  • Comparison without and with combined sliding window detectors in PASCAL VOC 2009

  • SO FAR: what objects, where are the objects, how many objects - through classical CRF-based methods

  • New competition in CVPR 2017: Make PASCAL Great Again (PASCAL VOC ended in 2012)

  • DPM, and improvements were used through 2007-2012 on PASCAL VOC, to get to 40$ mAP

  • Post-competition, in 2013, Regionlets were used to jump up to 40% mAP directly

  • Also, in the previous methods, Unary Potential was learnt through supervised learning, Clique Potential was unsupervised

  • CNNs (obviously) changed the game

  • CNN performs object classification, R-CNN does object detection; how to adapt for Semantic (pixel-level) Segmentation?

Semantic Segmentation using Fully Convolutional Networks

  • CONVOLUTIONALIZE: First up, convert CNNs + Fully connected -> Fully convolutional

  • Use AlexNet, VGG, but replace the FC layers with CNNs

  • In the second part, upscale the layers to get back a layer with image-size full resolution

  • Append 1x1 convolutions with channel predictions

  • Combining several scales:

    • combine where (local, shallow) with what (global, deep): fuse the features from different levels into a “deep jet”
    • use skip layers, skipping with stride (comparing 32, 16, 8, best is with 8 stride)
  • Thus, pixel-level segmentation was achieved,

  • But this required pixel-level ground truth for training

  • Can we use weaker forms of supervision? Maybe bounding boxes, or just text tags (“cat”, dog”)

Weakly-supervised methods

  • MIL-based [Pathak et al., ICRWL’15]

  • Image-level aggregation [Pinheiro & Collobert, CVPR’15]

  • Constraint-based [Papandreou et al., ICCV’15; Pathak et al., ICCV’15] (e.g.: at least p% must have that label)

  • Papandreou et al.: Not very good with weak supervision using p% constraint (GT bird -> predicted bird+plane example)

Karteek Alahari (INRIA Grenoble) - Semantic Segmentation (13:30 - 14:30)

  • Papazoglou et al., ICCV 2013

  • EM-Adapt Papandreou et al., 2015

  • M-CNN Tokmakov et al., ECCV 2016

    • Weakly-supervised semantic segmentation with motion cues
    • Video + Label -> FCNN -> Category appearance, Motion segmentation -> GMM -> Foreground appearance, (Category, foreground) -> Graph-based inference -> Inference labels
    • Better than Papazoglou et al.’s, better than EM-Adapt
    • Fine-tuning by re-training with intersection of outputs of EM-Adapt & our M-CNN
    • Pathak ICLR, Pathak ICCV and Papandreou use much more amount of data, but we achieve more accuracy in Weak Supervision (to be fair, they are pure weak supervision)
  • Of course, now there are better standards to compare with

LEARNING MOTION PATTERNS IN VIDEO [ArXiv Tech. rep. 2016]

  • FlyingThings dataset, Mayer et al., CVPR 2016: synthetic videos of objects in motion

  • Summary of motion estimation, video segmentation

MP-Net (Encoder-Decoder Network)

- Optical flow -> Encoder -> Decoder -> Objects in motion
- Encoder: allows a large spatial receptive field
- Decoder: output at full resolution
- Image from FlyingThings, Ground Truth optical flow -> Motion segmentation

DAVIS Challenge [Perazzi et al., CVPR 2016] (Densely Annotated VIdeo Segmentation dataset)

- Image -> Estimated Optical Flow (LDOF) -> Motion segmentation
  • Optical flow can be computed using CNNs

  • Try to capture what the “object” in the scene is

  • Combine MP-Net prediction with “object-ness” to get better prediction (as a sort of post-processing)

  • We can refine segmentation using a Fully-connected CRF [Krahenbuhl and Koltun, 2011]

    • Unary score + colour-based pairwise score
  • Evaluation datasets:

    • FT3D (FlyingThings): 450 synthetic test videos, use ground truth flow
    • DAVIS: 50 videos
    • BMS (Berkeley Motion Segmentation): 16 real sequences corresponding to objects in motion
- Mean field inference iteration as a stack of CNN layers

Gaurav Sharma (IIT Kanpur) - Face and Action (11:00 - 12:30)

FACES

  • Motivation for studying faces, cameras, etc.

Face Recognition

- compute a basis set of faces, represent faces as weighted combination of basis faces
- So instead of manually specifying the length of the nose, etc., the set of weights representing a face does the same automatically
- Now that faces can be represented as a vector of weights, one can apply standard classification algorithms like Neight Neighbours, etc.
- Make 256 binary patterns (patterns thresholded to make binary), run on image, make histogram of the number of each of the 256 patterns within the image
- Image is divided into grids, and histograms are computed for each grid, and fused
- Then use SVM, etc., to classify
  • Other methods use the same sequence but using SIFT, SURF, etc. features

Face Identity Verification

- Check if two faces are of the same person or not, doesn’t matter what the name of the person is
- Applications: image retrieval (indexing large video collections), clustering (visualization, clustering), to maintain privacy
- Challenges: large amount of data, we need ~40TB of RAM to do this at world scale

Distance Metric Learning

- To extract distance as a measure of similarity in semantics, we need to train distance metrics
- Use Mahalanobis-like distance: (x_i - x_j)^T * M * (x_i - x_j)
- Different supervision methods: class supervision, pairwise friend/foe supervision (hard) relative triplet constraints (soft)
- Discriminative and Dimension-reducing embedding: M = L^T * L; distance = (x_i - x_j)^T * M * (x_i - x_j) = (x_i - x_j)^T * L^T * L * (x_i - x_j) = ||(L*x_i - L*x_j)||_2
  • But most of these assume linear embedding

Non-linear Embeddings

- [Taigman et al., DeepFace, CVPR 2014](https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf): Use Deep CNNs as an embedding (instead of a linear embedding)
- Proposed by Facebook, used Social Face Classification dataset (~4.4M images, ~4k identities) (Facebook proprietary)
- After training, remove the final classification layer, use the last FC layer (after normalization) as features

Siamese Network

- [S. Chopra et al., Learning a similarity metric discriminatively, with application to face recognition, CVPR 2005](http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf)
- Use a siamese network to say if two images belong to the same person or not
  • Labeled Faces in the Wild (LFW) 2007 dataset

    • 13k images of faces of 4k celebrities
    • same, not-same pairs
  • DeepFace Ensemble came close to human level verification (~97%) (should be taken with a grain of salt) (by Facebook, 2014)

VGG Face

- By Oxford, in 2015
- Semi-automatic creation of large publicly available dataset  - 2.6M images, 2.6k identities (weakly made verification, then human annotated, possibly in Hyderabad)
- Used Triplet Loss, and an adaptive objective, achieved comparable results to DeepID and FaceNet
  • But, the problem is not solved
    • Compression Loss: all images are being compressed to save memory
    • Large scale: in a large dataset, it is highly possible to find a different face with similar illumination/pose; without distractors 97% accuracy, with distractors 70% accuracy

Age Estimation

- [Liu et al., AgeNet: Deeply Learned Regressor and Classifier for Robust Apparent Age Estimation, CVPRW 2015](www.jdl.ac.cn/doc/2011/201611814324881700_2015_iccvw_agenet.pdf)
- Task is to find the human estimation of _apparent_ age, not the real age
- Pre-train a multi-class Face Classification network -> Fine-tune with Real Age from, say, passport data -> Fine-tune with Apparent Age

ACTION

- Find interest points that change in both space and time
- Use BoW on spatio-temporal interest points
- Use motion wisely to figure out trajectories (within 15-20 frames, mostly <1sec), use information around trajectories to suppress background trajectories (like camera motion)
- Dense sampling at multiple scales -> Tracking in each scale -> Feature description of trajectories (using HoG, HoF, MBH)
- These features play the same role as spatio-temporal interest points
- Classification can be done based on these features
- Can be used with other aggregative methods like Fischer encoding
- Use appearance + motion
- Input video -> Single frame -> 1 ConvNet for spatial stream, Multi-frame optical flow -> 1 ConvNet for temporal stream -> fuse both streams at the end
- Optical flow: [Brox et al., High accuracy optical flow estimation based on a theory for warping, ECCV 2004](http://www.mia.uni-saarland.de/Publications/brox-eccv04-of.pdf)
  • Previous standard - iDT (improved Dense Trajectories)

  • Then people thought convolution itself can be done in both spatial and temporal dimensions, so-

- 3D ConvNets as general video descriptors
- Train on large datasets
- C3DD + iDT + Linear SVM
- To reduce actions into single images
- Dynamic image = RGB image to summarize video = Appearance + Dynamics
- Rank Pooling: pool frames from the video according to their rank, but it is not differentiable
- Dynamic Images are more suitable to dynamic actions (push-ups, etc.), while RGB images are more suitable to static actions (playing the piano, etc.)
- [Suleiman and Zisserman, the Kinetics Dataset](https://arxiv.org/abs/1705.06950)
- [Carreira and Zisserman, Quo Vadis, Action Recognition? CVPR 2017](https://arxiv.org/abs/1705.07750)
- The nail on the coffin
- Data was the bottleneck - proposed Kinetics Dataset
- Convert 2D ConvNets into 3D, pre-trained as 2D and repeated in time
- Huge jump in UCF-101 (98%) and HMDB-51 (80%) datasets accuracy with pre-training on Kinetics dataset

Gaurav Sharma (IIT Kanpur) - Face and Action (14:30 - 15:30)

  • (x_i - x_j)^T * L^T * L * (x_i - x_j) = ||(Lx_i - Lx_j)||^{2}_{2}

  • We use L to project our input space into a space with better distance metrics for the semantics that matter, i.e. L*x_i

  • Heterogeneous setting: some images have identity, some other images have tags, etc.

PROPOSED METHOD

  • Distance = Distance_common_across_tasks + Distance_specific_to_task

  • During training, learn all tasks together -> update common projection for all tasks -> update projection for specific task

  • Experimented with large datasets:

    • LFW
    • SECULAR - took images from Flickr, so hopefully no overlap with celebrity faces in LFW
  • Comparable methods: WPCA, stML (single task), utML (union of tasks)

  • Identity-based retrieval: (Main task, Auxiliary task)=(Identity, Age)

  • Age-based retrieval: (Age, Identity)

  • Also added expression information

  • Adaptive LOMo

  • Adaptive Scan (AdaScan) Pooling - CVPR 2017

20170706

Vineeth Balasubramanian - Visualizing, Understanding and Exploring CNNs (09:30 to 11:10)

  • UFLDL: tutorials on deep learning

  • Deep learning is awesome because, not much manual design of weights

  • Understanding CNNs: visualize patches, visualize weights, etc.

Visualize patches:

-  Visualize patches that maximally activate neurons
- what pattern/texture caused this particular neuron to fire?
- [Rich feature hierarchies..., Malik et al., 2013](https://arxiv.org/abs/1311.2524)

Visualize the weights:

- the weights are the filter kernels
- Some look like Gabor filters (Gaussian over a sinusoid)

Visualize the representation space (e.g. with t-SNE)

- t-SNE visualization [van der Maaten and Hinton, "Visualizing High-Dimensional Data Using t-SNE", Nov 2016, Journal of Machine Learning Research](http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
- t-SNE, IsoMaps… these are non-linear dimensionality reduction techniques

Occlusion experiments:

- Occlusion experiments, in [Visualizing and Understanding Convolutional Networks, Zeiler and Fergus, 2013](https://arxiv.org/abs/1311.2901)
- They put grey boxes in random places in an image, forward passed it, and found in which cases was the right class predicted, so as to understand what in the original image caused the right classification

Deconv approaches:

- Start from a neuron of our choice to find out what it learns:
    1) Feed an image into a trained net
    2) Set the gradient of that neuron to 1, and the rest of the neurons in that layer to 0
    3) Backprop to image
- Another way is Guided Backpropagation, from [Striving for Simplicity, by Dosovitskiy et al., 2015](https://arxiv.org/abs/1412.6806):
    - While backpropagating, don’t just set those neurons whose activations were negative (in case of ReLU) to 0, also set those gradients to 0 which are negative
    - This gives better visualization than just Deconv

Optimization to Image

Fooling Neural Networks

Explaining CNNs: Class Activation Mapping

  • After Conv layers, make a layer of the Global Average Pooled values of each channel, and make a classification layer with weights w1, w2, etc. corresponding to the GAP values

  • Using the weights, compute w1c1 + w2c2 + ... , where c1, c2, … are the original channels, not the GAP values

  • The resultant map is called Class Activation Map (CAM)

  • This can tell us what area of the original image the classifier was focussing on to predict the right class

  • Grad-CAM:

    • CAM required a re-training of a network. Guided Grad-CAM avoids this
  • Guided Grad-CAM:

    • Use Grad-CAM with Guided Backpropagation
  • Results of Guided Backpropagation, Grad-CAM, Guided Grad-CAM

Karteek Alahari - Pose and Segmentation (11:40 to 12:30)

  • Tokmakov et al., 2017

  • Video -> Optical flow->Motion network, Appearance network -> Classifier

STEREO

  • Pre-CNN era overview:

    • Disparity between L and R views -> Depth cues
    • RGB -> Motion cues
    • Disparity + RGB -> Person detections
    • Person detection -> Pose estimation -> Pose masks
    • Depth cues + Motion cues + Pose masks -> Segmentation
  • Problem: given the disparity between L and R:

    • estimate Pose (𝚯)
    • estimate pixel label (person)
    • estimate disparity parameters (𝛕): layers, the layered ordering of people
  • Computation of this over all possible values is an NP-hard problem to solve

  • So instead, Energy function = Unary term + Spatial Pair-wise energy term + Temporal Pair-wise energy term

  • Spatial Pair-wise term = Disparity smoothness + Motion smoothness + Colour smoothness

Narendra Ahuja - Bird Watching

  • Bird watching is a Low SNR, Low Sized detection

Bird watching:

- Stereo camera rectification of two images
    - Rectify using star points from RANSAC
- Foreground detection
    - Now that we have aligned the stars, leave the stars behind, and only see the birds, or planes, etc.
- Geometry verification
    - Improve SNR by seeing what remains consistent, or which is not like cosmic noise
- Trajectory estimation
    - Now that we have bird points, compute the trajectory

Image Super Resolution

PANEL DISCUSSION

Moderator: For a PhD student in CV today, how to choose problem or approach to work on?

Moderator: How is research different in the internet-era today?

Moderator: How to do cutting-edge research in India?

Moderator: What advantages does India have?

Gaurav Sharma, IIT Kanpur

  • Trivial solution: If you already like a problem, no issues
  • If only generally interested: scan conferences, pick a pile of papers, shortlist some topics
  • How about finding a problem which hasn’t been solved yet? Discuss with advisor.

Karteek Alahari

  • Most important: the problem has to speak to you, YOU need to be interested in it
  • Advantage with CV: many problems to choose from
  • There are several textual resources, many visionaries and big names give talks in Key Notes, etc., so ideas are available everywhere

Qualcomm

  • Probably IIIT-H is awesome
  • Go for problems that have real-world value
  • Look at solutions in non-CV areas
  • If there is no other way one can solve from a non-CV formulation, then only go for a CV-based solution

Moderator: Is research a race now?

Karteek Alahari

  • It is a race, but every person has the potential to produce their own solution to any problem, so there is potential to produce a variety of solutions

Gaurav

  • Yes, it is a race, but it is an advantage
  • It is a personal choice
  • Go to conferences, talk to people

Moderator

  • Karthik is right, problem should speak to you
  • But, how do you approach it? So many advances so fast
  • One needs to keep up with the advances
  • An individual just starting might not get the nuances that a Research Group has learnt
  • The tricks of the trade matter
  • Don’t believe everything in a paper, not that they are lying
  • Use the internet, email the authors, etc. if you find discrepancies with theory in paper and implementation

Narendra Ahuja - UIUC, ITRA

  • A tuning fork resonates at a certain frequency, a nearby utensil will also. Society will prosper if its citizens are attuned to its problems
  • Whole day, we deal with problems - diseases, environment, etc.
  • If we are attuned to these problems, and we have empathy, it is natural to want to solve these problems
  • Many people have many talents. Whatever springs up in you in trying to solve this problem using your own talents is automatically good for the society
  • Approach is subservient to the problem
  • Internet - we are much more capable of solving problems now. We don’t have to struggle for data, Data is pouring in, there’s IoT, etc.
  • There’s no way what you do is not cutting-edge
  • Solve a problem within your own bubble, it’s ok
  • Advantage in India - plenty of problems!
  • Power, doability are not issues, attune yourself
  • Institutions (like IIIT-H) need to bring people back to life, not just worry about next job, next car, etc.

Moderator: But, like, Autonomous driving, India is the wrong place to start, right? How to balance extra-challenging problems outside of the society, vs problems that need to be solved right now?

Narendra

  • Who is asking you to solve autonomous driving?
  • All of us can agree that we don’t need to pick only from the top of the pile

Qualcomm

  • Agreed, plenty of data, plenty of problems to solve
  • Changes: more grants, etc
  • More importantly, you can build on other people’s work. You can solve one problem, and there’ll be a completely different group that can take it forward

Moderator: advantage of India?

Karteek

  • Biggest advantage with India: people people people!
  • We need annotations for fully supervised, lots of people to get into research, to help out

Gaurav

  • Agree with Narendra about local Indian problems
  • Disadvantage with India: we are not making products
  • In US, little contractors who collaborate with institutions and act as middlemen for industry. That is a challenge here.
  • To get into ground zero of research is difficult here
  • Advantage: lot of people

Moderator: Internet era?

Gaurav

  • Personally, stressful. Things come up very fast. Start with a problem, in 6 months arXiv has a paper.
  • But there are only a few groups doing this. Because there is so much info, you can predict where their next paper is going to be.

Audience: Did you tackle a domain shift due to deep learning? For e.g., a 4th year PhD guy realized all his past work was worthless because deep learning can solve so much?

Moderator: You’re talking about 4th year PhDs, what about faculty??

Qualcomm

  • You’re learning to dig a hole.
  • It is guaranteed that what you’re working on today is not what you’re going to work on 20 years from now
  • I’ve changed areas 4 times already

Gaurav

  • Advice: get used to it.
  • Adaptability is key

Moderator: yeah, keep changing

Audience: India has so many people, next ImageNet? Can we use the people?

Gaurav

  • You pay them.
  • Also, you give specific instructions
  • VGG-Face dataset was annotated in Hyderabad.

Karteek

  • There has to be some sort of reward
  • There has to be a coordinator

Qualcomm

  • Maybe crowdsource it?

Gaurav

  • Datasets are very planned, done by professionals, not students (not to discourage you)

Audience:

Gaurav

  • The tricks of the trade need to be picked up
  • You have to push through

Moderator: Take help from other groups

Karteek

  • Collaboration is key

Audience: As a PhD student, how to balance about low-level (code, etc.) and high-level (work in a larger picture) details simultaneously? Also, does the distribution of time change?

Karteek

  • Yes, time distribution does change
  • It’s not always trivial to strike this balance, there will be times when you get bogged down
  • Keep talking to your advisor
  • Rely on others for help

Qualcomm

  • IP landscaping - first piece in investment

Gaurav

  • Time distribution changing is natural
  • People will observe that
  • Important to keep yourself motivated

Moderator

    1. Advisor gives problem and solution, do; 2) Advisor gives problem, solve; 3) Figure out problem and solution

Audience: What if someone makes a better method just before you were about to submit your paper?

Gaurav

  • If only I had a penny for every time that happened

Karteek

  • Don’t compare with arXiv, arXiv is not an accepted paper
  • Always, there will be some difference between your paper and theirs

Jawahar

  • Be happy that happened because great minds think alike

Gaurav

  • 3 aspects of any paper - new problem, new solution, new perspective
  • Try digging into that

Audience: Deep learning solves everything. So..??

Karteek

  • There’s still a lot to be done.
  • Constraints, Weak supervision, …
  • There is a lot of scope, we can do a lot more

Gaurav

  • Same abstract answer
  • Branch out, try to figure out the next steps

Audience: Metric to compare PhD students

Karteek

  • Similar to comparing people for any job
  • “What is the contribution you have made to solving any problem?”
  • Can it stand the test of time?
  • CV 10-year award; maybe it’s not something we can evaluate today, but maybe some years down the line

Moderator

  • Readiness to go forward is the real measure. How to go about that is a question though...
  • Erstwhile awesome papers now are nowhere close to winning 10-years awards

Gaurav

  • Can’t compare people, can compare papers
  • France prefers mathematics, America prefers ideas

Audience: What if you’re stuck during your PhD?

Karteek

  • There’s so much scope

Gaurav

  • If you’re feeling stuck, go broad

Moderator

  • PhD is not about finding a solution, it is about exploring the landscape of the problem
  • That is the way to go about a PhD

Random Qualcomm guy

  • You don’t have to be first to make an impact (Qualcomm < Intel, Google < Yahoo)
  • Adaptability - it’s not a zero-sum game
  • Like Karteek said, the problem should speak to you

Audience: What about a recent B.Tech.? How to choose what?

Moderator

  • Your heart

Gaurav

  • Something you liked - your senior worked on it, you read something, something clicks

Moderator

  • Like Prof. Ahuja said, some problem you yourself feel like solving, not just for money
  • It’s not about the actual subject, it’s the skills you pick up on the way

20170707

Venkatesh Babu (IISc) - Vision and Language (09:00 to 10:30)

  • Image is non-structural; text is highly structured, sensitive to structure

  • word2vec

  • Late fusion

    • Extract features from image and data, concatenate them at the last layer
    • Doesn’t work very well..
  • RNNs

  • LSTMs

Vinay Namboodiri, IIT Kanpur - Domain Adaptation (11:00 to 12:30)

  • Different types of learning: Pang and Yang, TKDE 2010

  • Domain Adaptation:

    • If X consists of two “domains”, we assume that the conditional probability of X|y is the same among domains, meaning the class labels are the same for X belonging to either domain, but the marginal probability P(X) is different for different domains
  • Meaning, it might be the same blue dress that we need to classify, but it might be a different perspective on the image

  • This is different from spitting training and testing data, because there we can pretty much assume that asymptotically we shall be sampling from both domains, but here we do not have information about the marginal probability of the domain we haven’t trained on

  • Let’s look at Pre-Deep Learning methods

SHALLOW Domain Adaptation Methods

Instance Re-weighting

- Take the instances, change the weights attached to each instance
- Maybe using Maximum Mean Discrepancy Loss
- TrAdaBoost method

Model Adaptation: Adaptive SVM

- Slightly perturb the classifier to better fit the small target domain instances
- Online re-weighting of classifier
  • But the next one gained more popularity

Feature Augmentation: Geodesic Map Kernels

- [Geodesic Flow Kernel for Unsupervised Domain Adaptation [B. Gong et al., CVPR 2012]]()
- Map the Geodesic Flow between the subspaces (using principal components) of the source data and the target data on the Grassman Manifold
- But this method is pretty cumbersome

Feature Transformation: Subspace Alignment

- This directly aligns the source and target subspaces using a Transformation Matrix M
- M is learned by minimizing the Bergmen divergence: F(M) = ||X_S * M - X_T||^{2}_{F}; M* = argmin_{M}(F(M))
- Worked best, among the classical approaches
- ICCV 2013

Dictionary Learning

- Learn a common subspace, a Shared Dictionary, that can minimize the distance between the source and target points
- This dictionary is a Shared Discriminative Dictionary
- Then use a Reconstruction Error-based classification
- CVPR 2013

DEEP Domain Adaptation Methods

Fine Tuning

- Freeze most layers, train the last couple of layers
- But, we are assuming that we do have some supervision for the target domain within the source domain
  • What if there is no supervision in target domain?

  • We need to put an additional constraint about the closeness of the source and target domains

  • We want to design an NN such that the means of the activations of the source domain instances and the target domain instances are close to each other

Deep Adaptative Networks

- Kernel Mean Matching: re-weighting the training points such that the means of the training and test points in a reproducing Kernel Hilbert Space (KHS) are close. How to do this using CNNs?
- Loss = CNN loss + MMD Regularizer
    - Here, the MMD regularizer is the RKHS distance between the mean enbedding
- Next paper: [Michael Jordan et al., 2015](https://arxiv.org/abs/1502.02791)

Deep Unsupervised Domain Adaptation

- Assume many labeled examples in source domain, not many in the target domain
- [Unsupervised Domain Adaptation by Backpropagation, Ganin and Lempitsky, ICML 2015](https://arxiv.org/abs/1409.7495)
- Network: Input -> Feature extractor -> Label predictor (Classifier)
- Right now, source sample features are quite apart according to their class, but target samples are not; meaning target samples won’t be classified properly, while source samples would be classified very well
- We want to extract features where both the source and target samples are mixed up, meaning the source and target features are indistinguishable, implying that classification of such features would be equally good/bad for both source and target samples
- So we add another branch from the Feature Extractor to classify whether a sample is coming from the source or target, and we want to train it adversarially so that it is not able to differentiate between a source and a target sample, implying their features are mixed up in the feature space
- Correct (according to class) mix-up shall simultaneously be taken care of by the Label Predictor branch
- To train adversarially, back-propagate the negative of the gradients from the Domain Classifier branch

Adversarial Discriminative Domain Adaptation

- Use separate CNNs for source and target
- Pre-train only Source CNN, adversarially train both, test with Target CNN

Venkatesh Babu (IISc) - Vision and Language (09:00 to 10:30)

LSTMS

  • LSTMs have a Cell State that is controlled by 3 gates: Forget Gate

Vinay Namboodiri, IIT Kanpur - Domain Adaptation - II (11:00 to 12:30)

Domain Adaptation for Detection

  • Challenge is one does not know if object is present, and if it is present, where is it

  • We want to align the subspaces of the bounding boxes

  • Our approach:

    • Train R-CNN detector on source subspace from GT bounding box
    • Obtain bounding boxes on source by using detector (avoid non-maxima suppression)
      • This works better than the GT bounding boxes
    • Obtain predicted target bounding boxes using detector trained on source
    • Learn subspace alignment using the source and target bounding boxes
    • Project source samples onto target subspace using aligned subspace
    • Train the R-CNN detector again with the source samples projected onto the target subspace

LSDA

MMD regularization

20170708

Dhruv Mahajan - New Architectures in Deep Learning (09:00 to 12:30)

- They showed that Dropout could be replaced with this
- But they showed that deeper networks perform worse than shallower ones, so:
- They said deeper perform worse because of vanishing gradients
- They added skip connections to eliminate this
  • [Huang et al., 2016], observed that after training, removing a block hardly affects performance - meaning there is a lot of redundancy in ResNets
- Dense block: Concatenate a layer’s output with every subsequent layer
- Like a ResNet, cascade Dense blocks together
- No. of parameters, flops required for similar accuracy is lesser than those for ResNets
- What if we could spend more time on harder images and lesser time on easier images?
- Better results than even DenseNets
- What about Large Scale learning?
- Do a clustering at a layer, then pass (forward or backward) only through that path that belongs to the right cluster
- Hierarchical Softmax
- Differentiated Softmax

Open problems

- Incremental addition of classes

CLASSIFICATION, DETECTION, SEGMENTATION - UNIFIED VIEW

+Detection:

  • R-CNN: apply CNNs to proposed regions

  • Fast R-CNN: apply CNN to full image, split the features according to regions

    • Isn’t end-to-end, cannot be back-propagated
  • Faster R-CNN: use a network to propose regions

    • Is end-to-end
  • But what about scale?

  • Feature Pyramid Networks [Lin et al., 2016]

    • Featurized Image Pyramid: Scale input image, pass different scale to CNN: computationally prohibitive
    • Single Feature Map: Pass input image to CNN, scale the CNN features, then predict: prediction accuracy does not increase much
    • Pyramid feature hierarchy: Predict at each scale
    • FEATURE PYRAMID: Scale down the features, concatenate with upsampling of downscaled features and then predict at each scale
  • Fully Convolutional Networks [Long et al., 2016]

    • Replace fc layers with conv layers
    • Introduce deconv layers for per-pixel prediction
  • PixelNet [Bansal et al., 2017]

    • Neighbouring pixels are highly correlated, which breaks the SGD IID assumption
    • SO pick sparse random pixels from different images in a batch, track its progrss through the layers, concatenate the same-pixel features across layers, use that for SGD
    • This does not break the IID assumption, and gets better results

+Instance Segmentation

MULTI-MODAL: Image + Text

  • Leverage text to understand images

Image Captioning

- Use pre-trained CNNs to get image embeddings
- Use a projection matrix to get word embeddings
- Learn an RNN and projection matrix to predict sentences, given image and word embedding as input

DEEP LEARNING FOR VIDEOS

  • Facebook is a video-first company: video is the first-class media

  • What is in the video? Recommendations, Objectionable content filtering

  • Where is it in the video? Interesting portions, summarization, thumbnail

  • Data evolution, classification, multi-modal

Dataset evolution

Aspects:

- Motion importance: long-term vs short-term motion modelling
- Supervised vs Semi-supervised: annotations at frame-level or video-level
- Video length: clips vs long videos
- Source: wild or controlled
  • Earlier datasets (2005): controlled environments

  • UCF-101 (2012)

    • Action recognition dataset: 101 actions, unconstrained environment
    • Another example: HMDB
  • THUMOS (2014)

    • Untrimmed videos: we don’t know where the action was
  • Activity Net (2015)

    • 200 action classes
    • Untrimmed video classification, temporal action proposal, temporal action localization
    • Activity Net Challenge
  • Kinetics (2017)

    • Large-scale dataset of YouTube videos for action recognition
  • Sports 1M (2014)

    • Very large scale data set
    • Sports recognition, >1M videos, >5M clips, 438 classes of sports
    • Very challenging - because of such huge data
  • YouTube-8M (2016)

    • 8M videos, >500,000 hours
    • 4800 diverse set of entities (not necessarily actions or sports)

VIDEO FEATURE EXTRACTION

  • Good aspects: scalable, …

  • Community was split: hand-crafted vs deep (now everyone is going deep)

  • Current best hand-crafted features:

    • iDT (Improved Dense Trajectories)
      • Dense sampling in spatial scales -> Motion modelling using Optical Flow -> Trajectory description at Vixel-level -> Feature extraction using HOG, HOF, MBH
      • Pros: No learning, don’t need large-scale training data
      • Cons: Highly hand-crafted, computationally intensive
  • Two-Stream Convolutional Network [Simonyan and Zisserman, 2014]

    • Compute per-frame spatial features using a Spatial ConvNet, compute motion features using a Temporal ConvNet, fuse them together and predict
    • Comparable results to iDT
  • But how to do the fusion of spatial and temporal features?

  • When to do temporal fusion? Karpathy et al., 2014

    • Single Frame vs Late Fusion vs Early Fusion vs Slow Fusion
    • They found Slow fusion performs better, so there’s a sweet spot where we need to go deep in a temporal sense as well
    • I [Dhruv Mahajan] would say there exist datasets where Slow Fusion seems to work best, not that Slow Fusion works best in every case
  • Video LSTM [Srivastava et al., 2014]

    • How about replacing late fusion with LSTMs?
    • Unsupervised
    • Video through LSTM -> 1) Image Reconstruction LSTM, 2) Future Prediction LSTM
  • 2D ConvNet vs 3D ConvNet

    • Most prior work uses 2D ConvNets
    • But they don’t model temporal information well
  • C3D: Generic Deep Features for Videos [Tran et al., 2014]

    • Train 3D ConvNets on large-scale supervised video datasets for feature learning
    • Use the output of the C3D network as features, for classification, detection, etc.
    • Good architectures for 3D ConvNet: on UCF-101, depth=3 performed best, while comparing depths of 1, 5, 7, and comparing increase or decrease of kernel width with depth
    • 8 conv, 5 pool, 2 fc, 3x3x3 conv kernels, 2x2x2 pooling kernels
    • Visualizing low-level filters
    • Visualizing computed features
  • Res3D: Better Architecture for Better Spatiotemporal Features [Tran et al.]

    • Conduct careful architecture search along several dimensions: depth 18 is good enough
    • Frame sampling rate: sampling 1 frame every 2-4 frames is fine => 8 fps is more than enough
    • Good input resolution: 128x128 is enough
    • Types of convolutions: 2D vs 3D vs 2.5D (???) vs Mixed (3D early on, then 2D)

VIDEO VOXEL PREDICTION

MULTI-MODALITY: Video + Text

MULTI-MODALITY: Video + Audio

Avinash Sharma - Recent Advances in 3D (13:30 to 15:30)

  • Applications of depth, 3D: autonomous navigation, VR, scene understanding (3D mapping,...)

  • Depth from Stereo (classical)

  • Depth from Monocular Image Liu et al., 2016

  • Depth from Multiple Images / Videos [Karsch et al;., 2012]

  • Depth from X: Shading Prados et al., 2006, Focus Favaro et al., 2002

  • Depth from Active Sensing Lanman et al., 2009: differences in structured lighting, IR lighting, laser scanning

DEPTH FROM STEREO

  • (Adopted from M. Pollefeys) Disparity between source images in the eyes causes 3D

  • (S. Savarese) Stereo pair: Camera centres O1, O2; point in real world P; projections on camera of P p1, p2

  • Steps: Camera calibration -> Rectify images -> Compute disparity and estimate depth

  • (S. Fidler) Epipolar line; we see that epipolar lines are slanted

  • Rectification: To straighten the epipolar lines; use Homography Tx to make image planes parallel to baseline

  • (S. Lazebnik) So, post rectification, mapping of one point in L to only displacement in x in R.

  • Disparity x - x’ per pixel x in L can be found by searching for x’ in R that minimizes the difference in intensities

  • Also, Disparity x - x’ = B*f/z, where B is the baseline, f is the focal length, z is the distance from the camera centres to the point in real world

  • Thus, depth z can be found from disparity

  • Challenges: local intensity values differ in the two cameras, specularities, intensity values at multiple places, missing pixel values, repetitive patterns, occlusions, transparency, perspective distortion

  • To overcome local challenges, go for Path-Level Cost Aggregation (Sum of Absolute Differences (SAD) in entire path, Normalized Cross Correlation (NCC), Mutual Information)

  • Challenges: assumes constant depth within patch (invalid for depth discontinuities, slanted/non-planar surfaces), repetitive textures

  • Dynamic Programming

  • Improving correspondence using Global Optimization

  • MRF

  • Deep learning: use deep networks to estimate disparity

  • MC(Matching Cost)-CNN [Yann LeCun et al., 2015]

    • Check if two patches correspond to each other using a CNN
    • Aggregate over adaptive set of patches
  • Efficient Deep Learning for Stereo Matching [Luo et al., 2016]

    • Replaced concat and fc layers with an inner product, used Right Image Patches instead of one patch
    • Directly compute disparity
    • Minimize cross entropy with GT, introduce smoothness
    • Very very less time and number of params compared to MC-CNN
  • A CRF energy function optimized via CNNs

Recent advances in 3D SHAPE ACQUISITION and CLASSIFICATION