Skip to content

Collection of impactful machine learning papers (shortly explained)

Notifications You must be signed in to change notification settings

paulinamoskwa/ml-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation


Machine Learning Papers

Collection of impactful machine learning papers.
(with short summaries)


Language

     Mixtral of Experts (Mixtral 8x7B)
     RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
     Mistral 7B
     Efficient Streaming Language Models with Attention Sinks (StreamingLLM)
     Llama 2: Open Foundation and Fine-Tuned Chat Models
     Generative Agents: Interactive Simulacra of Human Behavior
     Sparks of Artificial General Intelligence: Early experiments with GPT-4
     LLaMA: Open and Efficient Foundation Language Models
     Training Compute-Optimal Large Language Models (Chinchilla)
     Training language models to follow instructions with human feedback (InstructGPT)
     LoRA: Low-Rank Adaptation of Large Language Models
     Language Models are Few-Shot Learners (GPT-3)
     Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
     RoBERTa: A Robustly Optimized BERT Pretraining Approach
     Language Models are Unsupervised Multitask Learners (GPT-2)
     BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
     Improving Language Understanding by Generative Pre-Training (GPT)
     Deep contextualized word representations (ELMo)

Vision and Image Generation

     Grounded Segment Anything (Grounded SAM)
     Segment Anything (SAM)
     Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
     OneFormer: One Transformer to Rule Universal Image Segmentation
     BLIP: Bootstrapping Language-Image Pre-training for Unified VL Understanding and Generation
     High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)
     Grounded Language-Image Pre-training (GLIP)
     Masked Autoencoders Are Scalable Vision Learners (MAE)
     Simple and Efficient Design For Semantic Segmentation with Transformers (SegFormer)
     MLP-Mixer: An all-MLP Architecture for Vision
     Emerging Properties in Self-Supervised Vision Transformers (DINO)
     Hierarchical Vision Transformer using Shifted Windows (Swin Transformer)
     Training data-efficient image transformers & distillation through attention (DeiT)
     An Image is Worth 16x16 Words (ViT)
     Denoising Diffusion Probabilistic Models (DDPM)
     End-to-End Object Detection with Transformers (DETR)
     Mask R-CNN
     SSD: Single Shot MultiBox Detector
     You Only Look Once: Unified, Real-Time Object Detection (YOLOv1)
     Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
     U-Net: Convolutional Networks for Biomedical Image Segmentation
     Fast R-CNN
     Rich feature hierarchies for accurate object detection and semantic segmentation (R-CNN)

General AI

     Mamba: Linear-Time Sequence Modeling with Selective State Spaces
     FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
     FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
     Efficiently Modeling Long Sequences with Structured State Spaces
     Learning Transferable Visual Models From Natural Language Supervision (CLIP)
     Attention is All You Need (Transformer)
     Generative Adversarial Networks (GAN)
     Auto-Encoding Variational Bayes (VAE)




Papers Explained

Schema Summary
image

Mixtral 8x7B (MoE)

Mixtral 8x7B is a decoder-only transformer model for which the feed-forward blocks are replaced by mixture of experts (MoE) layers. MoE combines multiple sub-modules (the experts) in such a way that during inference only two of them are used. A router is trained together with the experts to learn to which expert it should go to based on the input. Surprisingly, the authors did not observe obvious patterns in the assignment of experts based on topics. Instead, the router exhibits some structured syntactic behaviour. Mixtral 8x7B is a 13B params model (two experts are selected for each token prediction), but has a the performance of a 47B params model (total number of params over all experts). Note that, although Mixtral 8x7B has the inference cost of a 13B params model, it still requires the memory cost of a 47B params model.

image

  RLAIF

This paper explores a method for training LLMs using feedback generated by another LLM, instead of human feedback. RLFH is known to be effective, but expensive. RLHF is a strategy to align a LLM with human preferences by asking a lot of humans to read and rank some of the LLM's responses. The rankings are used to train a reward model, which is used in a RL fashion to fine-tune the original LLM. In RLAIF (RL from AI Feedback), an existing LLM is used to generate feedback on the LLM being trained, replacing humans. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. The paper studies also the possibility of directly using the LLM feedback as reward signal in RL (without distilling it into a reward model). This ended up being preferred over the supervised fine-tuned baseline more than the standard RLAIF.

image

Mistral 7B

Introducing Mistral 7B, a groundbreaking LLM outperforming open source sota (LLaMa 2). This work demonstrates that language models may compress knowledge more than what was previously thought. Mistral 7B is a transformer-based architecture that leverages grouped-query attention for faster inference, coupled with sliding window attention (implemented through rolling buffer cache) to effectively handle sequences of arbitrary length. The paper also introduces the pre-filling of the KV cache with the prompt, as it is fully known at the very beginning (in advance to the first prediction). If the prompt is too large, it is chunked into smaller pieces. Additionally to Mistral 7B, the authors release also Mistral 7B-instruct, a version fine-tuned to follow instructions.

image LLMs suffer of fluency when the context grows too long. An approach to extend context lenght is window attention, where only the most recent keys and values are cached. This approach shows a spike in perplexity as soon as the window moves from its initial position (and the first tokens go out of the cache). This paper observes the phenomenon of attention sink: autoregressive LLMs learn to allocate the extra attention they don't need to the very first tokens (first position tokens are used as collectors of unused energy during attention; their attention scores are disproportionately large). The semantics of these tokens doesn't even matter. Due to attention sink, the paper proposes StreamingLLM, a variation of sliding window attention that always keeps the first tokens in the KV cache as the attention window shifts. StreamingLLM enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning or spikes in perplexity.

image

LLaMA 2

LLaMA 2 is a collection of pre-trained and fine-tuned LLMs ranging from 7B to 70B parameters. Two families of models are released: LLaMA 2, an updated version of LLaMA trained on a new mix of data (40% more tokens) and with some changes (increased context length and grouped-query attention), and LLaMA 2-Chat, a fine-tuned version of LLaMA 2 optimized for dialogue. Fine-tuning is accomplished via RLHF, a reward modeling strategy that aligns model's responses with human expectations (helpfulness and safety). RLHF is explored with PPO and rejection sampling. Additionally, for fine-tuning, the paper introduces Ghost Attention (GAtt), an approach to control dialogue flow over multiple turns. GAtt allows for an instruction to be respected throughout the whole dialogue. According to human evaluation, LLaMA 2-Chat is on par with ChatGPT. Authors warn that safety testing has been done in English and might not cover all scenarios.

image Introducing generative agents, computational agents that simulate human behaviour. Long term coherence is ensured by generative agents' architecture, which stores, synthetizes, and applies relevant memories to generate believable behaviour using a LLM. The architecture has three main components: a memory stream (long-term memory database that records in natural language agents' experiences), a reflection mechanism (synthetizes memories into higher-level thoughts over time), and a planning mechanism (translates conclusions and the current environment into action-plans). A memory retrieval function combines relevance, recency, and importance to surface the records needed to inform the agents. During evaluation (agents being interviewed), these agents produced believable individual and social behaviours.

image Exploratory paper about GPT-4 capabilities. Authors had access to the raw, unrestricted version of GPT-4. The exploration, closer to traditional psychology rather than machine learning, is designed to separate true learning from mere memorization. GPT-4 shows ability in image understanding (it generates images understanding the elements' relationships), coding (it passes mocked tech interviews), 3D game development, solving mathematical problems (unseen math olympiad and Fermi questions), and real-world interactions (personal assistant, handyman). Moreover, GPT-4 is great at theory of mind, understanding human beliefs and emotions from context. However, GPT-4 lacks in ability of planning ahead, is pro propaganda and conspiracy theories, and lacks of motivation (it behaves passively).

image

LLaMA

LLaMA is a collection of foundation language models ranging from 7B to 65B parameters, trained on trillions of tokens using solely publicly available datasets. LLaMA-13B outperforms GPT-3 (175B). This work is inspired by Chinchilla scaling laws, and aims at being cheaper at inference. It underlies the power of training on more data. LLaMA mixes some tricks from other LLMs. As in GPT-3, it normalizes the input of each transformer layer, instead of normalizing the output. As in PaLM, it uses SwiGLU instead of ReLU. As in RoFormer, it removes absolute positional embeddings and adds rotatory positional embeddings (RoPE). LLaMA also includes several optimizations to improve the training speed.

image

Chinchilla

Given a fixed FLOPs (compute) budget, how should one scale the model size and the number of training tokens? Three different approaches are proposed in this paper, all reaching the same conclusion: the number of parameters and the number of training tokens should be increased equally with more compute. This is in contrast with previous work on scaling laws for LLMs, which led to a trend of increasing model size without properly scaling the number of training tokens. To validate their analysis, the authors trained a modified version of a huge non-computational-optimal model, Gopher, making it smaller (70B params) and training it on more tokens (1.4T tokens): Chinchilla. Chinchilla outperforms Gopher.

image Introducing InstructGPT, a GPT-3 model fine-tuned on a reward signal learned from human data. The goal is to align a LLM with human preferences. Given a prompt, the LLM produces multiple outputs, and human preferences are collected and used to train a reward model. The reward model is then used as a signal for fine-tuning the initial LLM: given a prompt and the model's initial output, the prediction of the next token involves the reward model's opinion. This strategy introduces human bias to the text data distribution. Standard LLMs learn to precisely model the distribution of data. In contrast, the RL component here is not doing distribution matching, but mode seeking (identifies and favours the most preferred outputs, rather than capturing the entire distribution of possibilities). The focus on mode seeking sacrifices diversity in exchange of reliable generations. (relevant previous work)

image

LoRA

Fine-tuning LLMs is expensive due to their size, as all their parameters (and their gradients) must be loaded into the GPU. LoRA (Low-Rank Adaptation) offers a parameter-efficient alternative by freezing the original weights and introducing a separate set of weights for fine-tuning. These fine-tuning weights are viewed as perturbation of the original weights, ensuring no increase in memory usage during inference. LoRA achieves efficiency by employing low-rank decomposition, representing the new weights matrix as the product of two matrices containing fewer parameters than the original matrix. For a transformer layer represented by weight matrix W, LoRA generates a new weight matrix ΔW which is then decomposed by lower-rank matrices A and B. A crucial hyperparameter in LoRA is the rank of the decomposition, r: if too low, the fine-tuning won't be efficient, if too high, computational resources may be wasted.

image

GPT-3

Previous language models, such as BERT, have a pre-training and fine-tuning stages. During fine-tuning, the model is trained via repeated gradient updates using a large corpus of example tasks. In contrast, GPT-2 employs single-step training, integrating the task description in the input. GPT-3 builds upon GPT-2, emphasizing scaling. GPT-3 is still a decoder-only architecture (with way more parameters than GPT-2). It is still trained in a generative way, and then, it is used in a in-context learning framework (as a few shot learner). In-context learning involves presenting examples of desired behaviour directly within the input. Unlike fine-tuning, in-context learning does not update the gradient. GPT-3 exceeded state-of-the-art.

image

T5

This (long) paper introduces T5 (text-to-text transfer transformer), a novel approach that frames all NLP tasks as a text-to-text problem. In this framework, the model learns how to map one input text to a target text, regardless of the specific task (translation, summarization, classification, etc). This simplifies the pre-training process as all tasks can be framed as text generation tasks. T5 model architecture is an encoder-decoder transformer. T5 is pre-trained on multiple tasks simultaneously using a unified objective (helps with generalization).

image

RoBERTa

RoBERTa (robustly optimized BERT approach) optimizes the pre-training of BERT. In particular, the paper proposes (1) to train the model longer, with bigger batches, over more data, (2) to remove the next sentence prediction objective (no gains), (3) to train on longer sequences, (4) to dynamically change the masking pattern applied to training data, and (5) to create a large new dataset (CC-NEWS) of comparable size to other privately used datasets. About (4): BERT masks randomly 15% of each sentence, but if the sentence is encountered twice then it'll be masked in the same way. RoBERTa instead implements an on-the-fly mask generation.

image

GPT-2

Previous to this paper, the approach for task-specific language modeling was always unsupervised pre-training followed by supervised fine-tuning. GPT-2 introduces a new approach and becomes a zero-shot NLP task model, i.e. able to perform specific tasks without a change in architecture nor being trained on particular data. To achieve its generality, GPT-2 is trained to predict p(output|input, task) rather than p(output|input). A training sample for translation is ("translate to french", english text, french text), and a sample for reading comprehension is ("answer the question", document, question, answer). How can this be enough? If the model is trained on enough (diverse) data, it is already implicitly learning a lot of different tasks.

image

BERT

Introducing BERT (bidirectional encoder representation from transformers). While GPT is decoder-only, BERT relies on transformer's encoder. The attention layer in the encoder is not masked, so each token can attain to all the others (that's why BERT is bidirectional). While GPT is pre-trained in a generative way (predict the next token), BERT is pre-trained on two tasks: predict randomly masked words in the input (15% of tokens are randomly masked), and determine whether a pair of sentences is composed of two subsequent sentences. The loss function for pre-training is the combination of the two losses. Once pre-trained, BERT can be fine-tuned for downstream tasks.

image

GPT

Presenting GPT (generative pre-training transformer), a method for pre-training and fine-tuning a language model using the transformer's decoder. Similar to ELMo, GPT generates contextualized word embeddings. However, instead of using bidirectional LSTMs, GPT relies on a stack of transformer decoder blocks, with positional information handled by positional embeddings. The network undergoes unsupervised pre-training, where it learns to generate the next word in a sequence. For supervised fine-tuning, the paper proposes various model architectures (head adaptations).

image

ELMo

Introducing a word embedding method that learns words representations based on their context in a sentence. ELMo (embeddings from language models) generates contextualized embeddings through the usage of deep bidirectional LSTMs. The same sentence goes through a stack of left-to-right and a stack of right-to-left LSTMs. Then, the left-to-right and right-to-left internal states are concatenated and (softmax-) weighted to form the final embedding for each word.

image

Grounded SAM

Introducing Grounded SAM, a combination of Grounding DINO (open set object detection) and SAM (promptable segmentation). Given an input image and a text prompt, Grounding DINO generates precise boxes for objects within the image by leveraging the textual information as condition. The annotated boxes obtained through Grounding DINO serve as the box prompts for SAM to generate precise mask annotations. Users have the flexibility to input arbitrary categories or captions, which are then automatically matched with entities within the images. Building upon this, it is possible to employ either an image-caption model (BLIP) or an image tagging model (RAM), using their output results (captions or tags) as inputs to Grounded SAM and generating precise box and mask for each instance. This enables the automatic labeling of an entire image, achieving an automated labeling system.

image

SAM

Introducing SAM (Segment Anything Model), a zero-shot segmentation model based on user prompts (points, boxes, texts). Promptable segmentation is a new vision task introduced by this paper. SAM is trained on SA-1B, a new 11M images dataset containing 1B segmentation masks. SAM takes as input an image and a prompt, and processes them with two independent encoders. The generated embeddings are then passed to a lightweight mask decoder that outputs three segmentation masks (one single mask creates issues with ambiguous prompts). In addition, SAM predicts also its own confidence score for each mask, i.e. what SAM thinks the IoU of the masks with the ground truths will be. This way, SAM is able to also have a confidence for its predictions. The mask with the highest (true) IoU with the ground truth is used to compute the loss, which is a combination of focal loss (focus on hard pixels), dice loss (variation of IoU), and MSE (between the actual IoU and the predicted one).

image The paper combines DINO (DETR with Improved deNoising anchOr boxes) with GLIP to extend object detection to unseen categories based on human input (open set detection). While GLIP teaches the model to associate words with visual regions (language aware object detection), Grounding DINO is a (hybrid) self-supervised algorithm that learns to detect objects without explicit labels. Grounding DINO leverages unlabeled data in the form of image-text pairs. For each (Image, Text) pair, image and text features are extracted using separate backbones, and fed into a feature enhancer module for cross-modality feature fusion. A language-guided query selection module selects then the visual features that are more relevant to the input text. These visual features, along with the text features, are fed into a cross-modality decoder. The output queries (attention layers involved) of the latter are enhanced visual features used to predict object boxes. The model learns to adjust boxes with a localization loss, and to map them to the correct text with a contrastive loss.

image

OneFormer

Introducing OneFormer, the first multi-task universal image segmentation framework based on transformers. With a single, fixed architecture, OneFormer is a task-conditioned model for semantic, instance and panoptic segmentation. It is trained only once using a panoptic segmentation dataset, and outperforms models trained individually on each segmentation task. During training, one segmentation task is sampled and passed to the model as text input. Then, the panoptic ground truth is adapted to the sampled task. Inside OneFormer, two sets of queries are created: text queries (text-based representation for segments in the image) and object queries (image features extracted and processed by a transformer). A contrastive loss is used to align the text and object queries, inherently guiding the model to distinguish between tasks.

image

BLIP

BLIP is a vision-language model for caption generation. BLIP architecture - Multimodal mixture of Encoder-Decoder (MED) - consists of three components: (1) unimodal encoders, which separately encode image and text, (2) image-grounded text encoder, which encodes text incorporating visual information, (3) image-grounded text decoder, which generate text in an autoregressive way while incorporating visual information. Three losses are used (respectively for each component): (1) image-text contrastive loss, (2) image-text matching loss, (3) language modeling loss. MED is pre-trained on noisy web data, and then is arranged to create two new models: a captioning model, fine-tuned on high quality data to generate precise captions, and a filtering model, fine-tuned to recognize good captions. The CapFilt framework is then used to clean web data and produce new samples, which are subsequently used to train MED.

image A diffusion process progressively adds noise to an image until the image becomes a gaussian noise. A diffusion model (U-net) learns the reverse process, moving from gaussian noise back to the original image. This is achieved through incremental steps, with the model taking each time a noisy image as input and producing a less-noisy image as output. How does the text come into play? The text is encoded and concatenated to the input (noisy image) of the denoiser (U-net). Any form of conditioning can be encoded (text, image, etc). In addition to the concatenation, inside the U-net model the attention layers are able to attend to the conditioning tokens. This paper introduces LDMs (Latent Diffusion Models), a way to address the image-size limitation of diffusion models. LDMs first encode the image into a latent space, conduct the whole diffusion and denoising processes inside the latent space, and then decode back the output.

image

GLIP

Introducing GLIP (Grounded Language-Image Pretraining), which unifies object detection and phrase grounding (localizing an object in an image referred to by a natural language query) for learning object-level language aware visual representation. Object detection and phrase grounding are similar, as they both seek to localize objects and align them to semantic concepts. Any object detection model can be converted to a grounding model by replacing the object classification logits with the word-region alignment scores (dot product of the region visual features and the token language features). GLIP loss is the combination of box regression and region-word alignment. GLIP includes a deep fusion between image and text encoders, making the detection model language-aware. Cross-modality communication is achieved by the cross-modality multi-head attention module (X-MHA), where each head computes the context vectors of one modality by attending to the other modality.

image

MAE

Inspired by NLP's BERT, this paper introduces MAE (Masked Autoencoders), a scalable and effective self-supervised approach for visual representation learning. In MAE, an image is divided into patches, and a random subset of these patches is masked. The visible patches are then fed into a transformer encoder. After the encoding, the masked patches are reintroduced, and a decoder learns to reconstruct the original image. Beyond pixel-level reconstruction, MAE learns powerful representations useful for downstream tasks. Both trained with self-supervised approaches, MAE differs from DINO since MAE learns representations by masking out certain parts of images and then learning to predict them, while DINO learns representations by creating pairs of images from the same source, but with different augmentations, and then mapping the pair to the same latent representation.

image

SegFormer

Introducing SegFormer (Segmentation with Transformer), a transformer-based architecture for semantic segmentation. After passing an image through multiple transformer encoder layers (where patches get aggregated after each layer), the representation of patches are decoded into a pixel-wise segmentation mask. This decoder head is a simple multilayer perceptron (MLP) that combines information from different encoder layers (hierarchical features from the encoder). While the encoder includes some downsampling at each step (merging of patches), the decoder takes care of the upsampling.

image

MLP-Mixer

Presenting MLP-Mixer for image classification, an architecture for vision based on multi-layer perceptrons rather than convolutions or attention. MLP-Mixer takes as input an image divided into patches. Patches are pushed through a linear layer (to reduce dimensionality) and then are processed by a mixer layer, where multi-layer perceptrons are repeatedly applied across features channels (to process information independently for each patch, enhancing local features) and spatial locations (to process information across patches, capturing relationships between them). MLP-Mixer demonstrates competitive performance on image classification compared to CNN-models and ViT, while having a linear complexity in terms of sequence length (ViT has quadratic).

image

DINO

Introducing DINO (self-distillation with no labels), a self-supervised training for ViT. DINO does not involve contrastive learning, only self-distillation. It consists of a teacher and a student, where the teacher is constructed as an exponential moving average of past student iterations. Given an image, an augmentation goes through the teacher, and a different augmentation goes through the student. DINO optimizes the student to learn the same representation produced by the teacher. More precisely, two augmentations x1, x2 of the image x go through the student and the teacher. Both networks generate a representation for both images: s1, s2, t1, t2. DINO teaches the student to make (s1, t2) similar, and (s2, t1) similar. This simple strategy leads to features that contain object boundaries (accessible from the last CLS attention map) and perform well in representing images (k-NN classification without fine-tuning).

image Swin (Shifted windows) Transformer builds upon ViT by introducing a hierarchical method for processing images. Instead of fixed-size patches for the whole image, it starts with small patches, processes them through a transformer layer, groups (merges) the output, processes them again, and so on. Within the transformer layer, a shifted window self-attention is used to limit communication between patches, reducing computational complexity while still capturing essential dependencies. Each patch is only able to communicate with its neighbours. By iteratively processing and merging patches, Swin transformer better handles varying image sizes and fine-grained tasks (than ViT).

image

DeiT

ViT achieves state-of-the-art results in image classification by using more than 300M images for training (JFT-300M). This paper introduces DeiT (Data-efficient image Transformer), an encoder-only model that achieves competitive results being trained only on ImageNet. Convolutions already know how to look at images (patch by patch, with learnable filters), so they introduce a bias. Transformer are way more general, so they need to learn also how to handle images directly from the data. This necessitates large datasets. DeiT circumvents this need introducing a teacher-student strategy that relies on a distillation token, ensuring that the student learns from a CNN teacher. Why a CNN? CNNs possess prior knowledge about images, so they can learn from less data more easily than a transformer).

image

ViT

Introducing ViT (Vision Transformer), an adaptation of the transformer model for computer vision applications (image classification, in this paper). While this is not the first paper to use transformers for computer vision (see DETR), it is the first work to completely discard CNNs. With ViT, an image is split into fixed-size patches, each of which is linearly embedded and combined with (learnable) positional embedding. The resulting sequence is then fed into a transformer encoder. For classification, an extra learnable classification token is added to the sequence.

image

DDPM

DDPM involves mapping a complex distribution of images into a simpler distribution (Gaussian noise) through a T-steps forward process, and then mapping it back through the backward process. The forward process is defined as Markov chain with Gaussian transition, causing also the backward process to have the same structure. The forward process is fixed, while we learn an approximation of the backward process. To learn the backward process, we maximize the log-likelihood of the data. As in VAE, we actually compute a lower bound and optimize for that. If we are able to create such a bridge (mapping) between the two distributions, then we can sample random noise and transform it to an image of the original complex distribution go to the original complex.

image

DETR

Introducing DETR (DEtection TRansformers) for object detection (and panoptic segmentation). DETR incorporates CNNs to initially process the input image and extract a HxWxC feature tensor. These features are seen as a sequence of 1x1xC features, which are then passed to the transformer model. The transformer generates a sequence of box predictions and class probabilities. Boxes can be set as empty (no need for non-max suppression at the end). To align the output of the model with the ground truths, a matching loss is used. The model optimizes the IoU between predicted and gt boxes, and also the exact number of predictions (the model is penalized if it predicts less empty boxes than expected).

image

Mask R-CNN

Mask R-CNN is an extension of Faster R-CNN that incorporates a pixel-level segmentation task alongside object detection. It introduces an additional branch in the network that predicts a (binary) segmentation masks for each detected object. Since the backbone network is shared for both tasks, Mask R-CNN efficiently produces accurate object masks while maintaining real-time performance. It enables precise instance segmentation, distinguishing between different object instances within the same class.

image

SSD

Introducing single shot multiBox detector (SSD), a single-stage approach for object detection. The idea of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps. To achieve high accuracy, SSD produces predictions from features maps of different scales.

image

YOLO

Presenting YOLO, a single-stage appraoch for real-time object detection. YOLO places a grid over the input image and makes each cell responsible for predicting (1) B bounding boxes, each associated with a confidence of the presence of an object within the boxes, (2) a conditional class probability ("If there is an object, the object is .."). The bounding boxes are then associated with the class probabilities. A threshold is applied to select only the most confident boxes.

image

Faster R-CNN

In R-CNN, region of interest (RoIs) are generated using selective search and processed individually by a CNN. Fast R-CNN improves upon this by processing the entire image just once through the CNN, but still uses selective search. Faster R-CNN takes a step further by directly generating RoIs from the features produced by the CNN. Once the RoIs are generated, the subsequent steps remain the same as in Fast R-CNN. Even if it uses the same features, the method remains a two-stages method.

image

U-Net

New architecture for semantic segmentation. U-Net consists of a contracting path (standard convolutional network) followed by an expansive path. The expansive path involves the upsampling of the feature map ("up-convolutions") and its concatenation with the corresponding feature map from the contracting path. Combining high resolution features from the contracting path with the upsampled output makes possible to generate very detailed segmentations.

image

Fast R-CNN

Fast R-CNN is a significant improvement over the original R-CNN. R-CNN processes each region of interest (RoI) independently. Fast R-CNN, instead, processes the entire image in a single pass through the CNN. Then, it projects all the RoIs (produced on the original image by an external algorithm) onto the feature space. Each projected RoI is resized through the RoI pooling layer and further processed. A multi-task loss that combines classification and bounding box regression is used (unlike R-CNN, where each component is treated separately).

image

R-CNN

Introducing R-CNN, a combination of region proposals and CNN features for object detection. R-CNN uses a region proposal algorithm (selective search) to first select some regions of interest. Each candidate region is resized to a fixed size and passed to the CNN, which then outputs a feature vector. The feature vector is fed into a collection of class-specific SVMs. Each SVM classifies whether the region (represented by the feature vector) contains an object of a specific class or not.

image The speed-up and memory savings of SSMs/S4 against transformers came at the cost of accuracy. The key weakness is that SSMs are not able to focus/ignore particular inputs. One way to introduce a selection mechanism is by letting the parameters that affect interactions along the sequence be input-dependent. Since the matrices describing the model, (A, B, C, Δ), are fixed, SSMs treat each token equally. This paper introduces selective SSM: (1) selective scan algorithm, which allows the model to filter (ir)relevant information, and (2) hardware-aware adaptation that allows for efficient storage of intermediate results through parallel scan, kernel fusion, and recomputation. Selective SSM is wrapped into an architecture (H3 with gate mechanism) to form a Mamba block. In short, the main difference between Mamba (S6) and S4, is simply making several parameters functions of the input. Mamba allows the transition through the hidden space to be dependent on the current input (not on the previous hidden states though, A is still fixed).

image FlashAttention 2 builds upon FlashAttention, implementing three novel ideas. (1) It reduces non-matmul FLOPs by tweaking FlashAttention's algorithm. Modern GPUs have specialized compute units that make matmul extremely fast; a non-matmul FLOP is 16x more expensive than a matmul one. (2) It parallelizes the attention computation over the sequence length dimension to improve occupancy. When dealing with longer sequences, the batch size is often small. Since standard attention processes sequences one at a time, part of the GPU remain unused. FlashAttention 2 breaks down the input sequence into chunks and then utilizes multiple GPU threads to process them simultaneously, improving the overall utilization of the GPU. (3) It provides better work distribution between warps (units of 32 threads scheduled and executed together).

image Standard attention mechanism is slow (and memory-hungry). This paper identifies the bottleneck not in computation, but rather in data movement. Naive attention requires repeated reading/writing between GPU HBM (the primary memory source of the GPU) and SRAM (smaller and faster memory, where the computations are based). This paper introduces FlashAttention, an IO-aware exact attention algorithm that uses tiling to operate in chunks (instead of gigantic matrices) and kernel fusion to combine multiple operations into a single step, reducing the overall time spent in R/W. These techniques result in linear complexity in memory and allow waay faster computations, since they avoid redundant data movements and calculations.

image To address long-range dependencies in seq-to-seq tasks, SSMs (state space models) have been proposed. SSMs describe the evolution of a system over time by defining two sets of equations: state equations and observation equations. For SSMs to effectively memorize the input history (gain memory), the matrix describing the hidden state transition needs to belong to the HiPPO class. While this approach works well, it's computationally impractical. This paper proposes S4 (Structured State Space), which allows for a more practical solution by decomposing the HiPPO-form transition matrix of the hidden state into the sum of low-rank and normal matrices, and then diagonalizing the normal matrix. S4 is as efficient as existing methods, while achieving better performance on tasks involving long-range dependencies.

image

CLIP

CLIP (Contrastive Language-Image Pretraining) is trained to predict similarities between an image and a text. Unlike traditional methods of learning visual representations, CLIP leverages the freely available text data on the internet (400M image-text pairs, when ImageNet has 1.2M images). During training, CLIP takes a batch of images and their textual descriptions, embeds images and texts independently, and then learns to predict the correct pairing (contrastive approach). CLIP transfers to most tasks and is competitive as zero-shot against some trained (and fine-tuned) models. CLIP does not address the poor data efficiency of deep learning. Instead, it compensates by using a source of supervision that can be scaled to hundred of millions of training examples (learning about images from text).

image

Attention is All You Need (Transformer)

Novel architecture for sequence-to-sequence tasks. Transformer relies exclusively on self-attention mechanisms, which enables each position in the input sequence to attend to all others simultaneously. This approach facilitates efficient capture of long-range dependencies and parallel computation. It achieves state-of-the-art performance in tasks like machine translation and text generation, setting a new standard for sequence modeling.

image

GAN

Introducing generative models trained via adversarial process. A generative model G captures the data distribution, while a discriminative model D estimates the probability that a given sample comes from the original data rather than being generated by G. The goal of G is to generate such good samples that D is fooled.

image

VAE

Introducing variational autoencoders (VAEs), a class of generative models that extend the traditional autoencoder (AE) by incorporating probabilistic inference. While in AEs the encoder maps the input into a latent space vector, in VAEs the encoder maps the input into a latent space distribution. The decoder samples from the latent distribution and reconstructs the input. VAEs minimize the same loss function as AEs, with an additional regularization term that encourages the latent space to follow a prior distribution, typically a Gaussian.