Real-time object detection is a difficult challenge. In general, object detection requires significant compute requirements. The scale of these compute requirements vary depending on the specific architecture of the object detector. For example, deformable parts models (DPMs) scanned a classifier over an image at multiple scales. An improvement on this is the R-CNN family of models, which propose regions of the image that are then sent to the classifier, reducing the number of classification inferences required. These are "Two-Stage Detectors".
One-Stage Detectors, such as the YOLO family, Single Shot MultiBox Detector, and RetinaNet, produce their predictions in one forward pass of the network. They are generally faster than two-stage detectors with similar (or better) accuracy. Their speed makes them a natural choice for real-time object detection.
For our application of diagnosing malaria, we chose to work off of the YOLO family of networks, originally created py Joseph Redmon and Ali Farhadi (https://pjreddie.com/). The architecture of the YOGO network is a simplification of the YOLO networks (specifically YOLO versions 1-3) optimized for our application.
This document will discuss the YOGO architecture in increasing scales of complexity. First, though, we must understand the specifics of our problem (diagnosing malaria) compared to object detection datasets.
YOGO was developed to detect malaria parasites in blood, in realtime, on limited hardware. Our specific application is simpler than typical object detection problems, due to the low number of classes and uniform sizes of all objects. Most object detectors are designed to perform well on large and varied datasets such as MS COCO, which has 80 classes with objects varying in size, shape, and position within the image.
In contrast, our original application deals with only seven object classes and a uniform (white) background. The following is a label-free image from the remoscope of cultured plasmodium falciparum:
(A lot of features in this image: blood cells, platlets, toner blobs, and a stuck cell! Also, some ring parasites)
And here is an example of an image from the YOLO9000 Paper (not necessarily in the MS COCO dataset):
Notice the varying sizes of scale for people and the large number of classes - there are "athletes", "contestants", and "unicycles" in this picture. A typical, smaller dataset would have just classified "athletes" and "contestants" as "people".
The relative simplicity of our problem allows us to strip back a lot of the complexity of a typical YOLO architecture.
The YOGO architecture is a fairly vanilla convolutional neural network (CNN). It has convolutional layers with batch norm layers for normalization, and a little bit of clever shaping to get final result tensor in the shape that we want.
See the model file for reference from here on.
There are many little variations of the model, (see model_defns.py
for more more architectures), but they all have a similar structure: a "backbone", which are all of the convolutional layers except for the final layer, and the "head", which is the final layer1.
You can imagine2 that the job of the backbone is to get the input image(s) into an abstract representation of the image (i.e. cells and their locations, cell types, e.t.c.), and the head is to turn the representation into a concrete prediction - specifically, a grid of "cells" that represent rectangular areas on the original image. Each of these "cells" predict whether or not there is a center of an object in that cell, along with the (potential) object bounding box and classification. The grid, and the values per grid-cell is our prediction tensor.
The output for an image is a 4D tensor of shape $(batch~size,predictiondimension,~S_x,~S_y)$. The dimensions
Before anything else, we must make the "grid of cells" concept concrete. To downsize the image / feature map, we use convolutions with a stride of 2 - this will effectively halve the height and width of the feature map each time. We do this enough times, and we end up with a tensor of width ~= 127, and a height ~= 95 (depending on the model architecture). Keep in mind that this grid is a feature map that is a result of many convolutions. Each cell has "information" from many surrounding cells, due to the convolutions and down samplings.
Each cell is responsible for making a prediction if an object's center is in that grid cell. For example, the blood from above would have a final grid representation similar to this3:
In practice, several grid cells may make predictions for the same object. It is common to use an algorithm like Non-Maximal Supression to choose the best box.
The prediction dimension has a size of 4 + 1 + (the number of classes). The "4" is from the bounding box prediction, the "1" is the "objectness" of the cell of a bounding box, and the number of classes is self explanatory.
For the bounding box prediction, we predict the center of the box (
Reading the code is the best method.
- The loss function is arguably the most important part of the project, as it is what gives meaning to the network's output.
- The training loop is pretty much just boiler plate, and is a bit ugly, but it is useful to read nonetheless.
- The
ObjectDetectionDataset
class and data loader have the code to format the labels into something that we can use in training. It could also be cleaned up, but it is not too bad, and may be helpful for understanding YOGO.
- YOLO, YOLO 9000, and YOLOv3 are the seminal papers written primarily by Joseph Redmon and Ali Farhadi, and they should be considered required reading to understand YOLO / YOGO. They are very well written!
- This is a great explanation of YOLO. It will be good for comparing / contrasting with YOGO, and will have some supplemental specifics
Footnotes
-
Note that these terms are not rigorous, and expect them to have slightly different meanings when reading other deep learning literature ↩
-
Interpretability of neural networks is still a very young field, and in general, we don't have a good idea of how exactly they work as they do, so use this only as a flawed mental model. ↩
-
Notice how much finer the grid cells are compared to the blood cells. In fact, the grid that we use is actually way finer! The "fineness" of the grid, the relatively large size of blood cells, and the assumption that blood cells are not stacked on-top of each other allows us to make the major simplifications that make YOGO different from YOLO. Also note that the size of the grid is another hyperparameter to tune. Here it is the grid for the base model at commit 62b31ab: ↩