In this directory, you can find several notebooks that illustrate how to use Deepmind's Perceiver IO both for fine-tuning on custom data as well as inference. They are based on the official Colab notebooks released by Deepmind, as well as some additional notebooks which I believe will be helpful for the community.
The notebooks which are available are:
- showcasing masked language modeling and image classification with the Perceiver
- fine-tuning the Perceiver for image classification
- fine-tuning the Perceiver for text classification
- predicting optical flow between a pair of images with
PerceiverForOpticalFlow
- auto-encoding a video (images, audio, labels) with
PerceiverForMultimodalAutoencoding
Note that these are just a few examples of what you can do with the Perceiver. There are many more possibilities with it, such as question-answering, named-entity recognition on text, object detection on images, audio classification,... Basically, anything you can do with BERT/ViT/Wav2Vec2/DETR/etc., you can do with the Perceiver too.
The Perceiver and its follow-up variant, Perceiver IO by Google Deepmind are one of my favorite works of 2021.
This model is quite elegant: it aims to solve the quadratic complexity of the self-attention mechanism by employing it on a (not-too large) set of latent variables, rather than on the inputs. The inputs are only used for doing cross-attention with the latents. In that way, the inputs (which can be text, image, audio, video,...) don't have an impact on the memory and compute requirements of the self-attention operations.
In the Perceiver IO paper, the authors extend this to let the Perceiver also handle arbitrary outputs, next to arbitrary inputs. The idea is similar: one only employs the outputs for doing cross-attention with the latents.
The authors show that the model can achieve great results on a variety of modalities, including masked language modeling, image classification, optical flow, multimodal autoencoding and games.
The difference between the various models lies in their preprocessor, decoder and optional postprocessor. I've implemented all models that Deepmind open-sourced (originally written in JAX/Haiku) in PyTorch.