This is a tech report for the early-access release of the RADIOv2.5 model family. We plan on publishing full papers on the techniques applied for this release in upcoming conferences, but wanted to share the latest models with the community as soon as possible.
On 7.22.24 we are releasing ViT-B/16 and ViT-L/16 pretrained models. Under the hood, we've made a bunch of improvements to the training algorithms to produce these models. Fortunately, the API remains exactly the same!
torch.hub.load('NVlabs/RADIO', 'radio_model',
version='radio_v2.5-l', # Can also be 'radio_v2.5-b' for the ViT-B version
force_reload=True, # Make sure you set this to True the first time you're requesting either of these two models
)
First off, our previous releases have been ViT-H models. While ViT-H is a very powerful model architecture, we've heard from the community that there is a need for smaller VFMs (Visual Foundation Model). With this release, we're releasing ViT-B/16 and ViT-L/16 models which still achieve very strong quality, while being much smaller and faster than ViT-H. In fact, we're so confident in our ViT-L/16 model (RADIOv2.5-L) that we think you should use it instead of RADIOv2.
A major issue we identified in the paper is that RADIO was "mode switching" based on the input resolution of the image. In effect, when the resolution was approximately less than 704, it was running in "CLIP + DINOv2" mode where the features were very relevant to those two teachers, but completely irrelevant to SAM. At >720px, RADIO was switching modes to produce features that were relevant for SAM, but suddenly incapable of modeling CLIP or DINOv2.
This would show up in strange ways, for example, trying to do zero shot classification at high resolution would degrade to random guessing (0.1% for ImageNet-1k). It also meant that our results at hi-res when integrated into a VLLM (e.g. LLaVA 1.5 / Vicuna 7B) were similarly poor. Starting with RADIOv2.5, we've solved the mode switching problem, and now these models are truly capable of processing any input resolution without surprising changes in behavior. In fact, RADIOv2.5 loves high resolution, where our best classification and VLLM results are coming from >= 768px resolutions.
Similar to the paper, we plot the MSE between the DINOv2-g-reg features and the RADIO model at various resolutions. While RADIOv2 (owing to the ViT-H) is able to achieve lower MSE at lower resolutions, you can see how at 720px, there's a huge spike in error and never recovers. This is how we quantified the mode switch. We can also visualize this phenomenon:
You can see how the 720px RADIOv2 (left) image abruptly changes representions, whereas DINOv2 and the RADIOv2.5 models (middle, right) remain consistent and instead produce increasingly fine-grained details. We can also see how RADIOv2 is working in reverse with the SAM head, where the low-resolution inputs don't produce features that are SAM-like at all. At 1024px, RADIOv2 starts to produce reasonable SAM features. On the contrary, RADIOv2.5-L produces SAM-like features at any resolution, and arguably does a better job of extrapolating to 2048px resolution.
Similarly to mode switching being directly observable in the spatial features, it was also causing issues with the summary features, which can be seen looking at zero shot classification:
Resolution | RADIOv2.1 | RADIOv2.5-B | RADIOv2.5-L |
---|---|---|---|
224 | 78.892 | 62.344 | 74.852 |
256 | 80.780 | 68.892 | 78.220 |
336 | 82.320 | 72.626 | 80.004 |
432 | 82.800 | 73.628 | 80.460 |
512 | 82.882 | 73.894 | 80.542 |
768 | 1.292 | 74.386 | 80.804 |
1024 | 0.204 | 74.280 | 80.886 |
Resolution | RADIOv2.1 | RADIOv2.5-B | RADIOv2.5-L |
---|---|---|---|
512 - ViTDet 16 | 82.370 | 70.488 | 78.102 |
1024 - ViTDet 16 | 0.192 | 72.182 | 79.878 |
Not only do the RADIOv2.5 models allow classification at any resolution, they also allow for using ViTDet mode with only a small drop in accuracy.
There is an important implication to fixing mode switching, which is that it's now possible to ask for both the CLIP and SAM features for a given hi-res image simultaneously, and the results will be meaningful for both. Or, you might want to get the hi-res DINOv2 spatial features as well as the summary token (for classification) for the same image. This wasn't possible with the RADIOv2 model because it wasn't able to simultaneously represent CLIP (or DINO) and SAM at the same time, but is now fixed with the v2.5 models.
Last but not least, we tested out the models at various resolutions within LLaVA 1.5 + Vicuna 7B:
Model | Resolution | GQA | TextVQA* | POPE | VQAv2 | ||
Val | TestDev | Tokens | No Tokens | ||||
RADIOv2.1 | 432 | 71.70 | 63.01 | 56.32 | 42.03 | 86.20 | 79.28 |
RADIOv2.5-B | 432 | 70.49 | 62.09 | 52.13 | 32.43 | 85.87 | 77.24 |
512 | 71.08 | 62.70 | 54.36 | 36.39 | 86.59 | 78.03 | |
768 | 71.99 | 63.31 | 56.93 | 43.96 | 87.54 | 79.22 | |
RADIOv2.5-L | 432 | 71.57 | 62.89 | 56.71 | 42.34 | 86.13 | 79.44 |
512 | 72.04 | 63.58 | 58.52 | 46.50 | 86.66 | 80.04 | |
768 | 72.91 | 64.13 | 61.93 | 53.95 | 87.68 | 81.02 |
*By default, TextVQA adds detected OCR tokens into the context of the LLM. Because we're interested in how well the vision encoder itself is able to represent text, we study TextVQA both with (Tokens) and without (No Tokens) these tokens.
SigLIP is an extraordinary ViT-L model, and we've added it as a teacher in the latest release. If you'd like to use RADIO's adaptor for it, you can get it using the 'siglip'
adaptor name. For example, in the examples/zero_shot_imagenet.py
script, you'd pass --adaptor-name siglip
as an argument to use SigLIP instead of the default DFN CLIP.
The specific SigLIP version we're using is ViT-SO400M-14-SigLIP-384
found in the OpenCLIP library.
Resolution | RADIOv2.5-B | RADIOv2.5-L |
---|---|---|
224 | 58.670 | 72.492 |
256 | 65.190 | 75.962 |
336 | 69.110 | 77.830 |
432 | 70.276 | 78.582 |
512 | 70.694 | 78.828 |
768 | 71.102 | 78.930 |
1024 | 70.900 | 78.922 |
As can be seen, the classification results using the SigLIP head are slightly worse than those of DFN CLIP, so we'd suggest defaulting to DFN CLIP unless you're specifically looking for compatibility.
While RADIOv2 may have a visually pleasing 1024px resolution video, you can clearly see how it switches modes between low and high resolution. All models exhibit strong temporal stability.