Skip to content

lavaman131/dinov2.cpp

Repository files navigation

dinov2.cpp

DINOv2 pretrained visual models in C/C++ using ggml and OpenCV.

Description

This project provides an implementation of the DINOv2 family of models in C++. These foundation models have been pretrained for image-level and pixel-level visual tasks, and provide a broad range of possible applications in image analysis. We aim to provide all the functionalities available in the pytorch implementation in C++. This lightweight version of DINOv2 is intended to reduce inference time and required memory, using ggml and OpenCV, particularly for use on edge devices. This implementation was heavily inspired by and built on existing code from vit.cpp.

Table of Contents

Features

  • Dependency-free and lightweight inference thanks to ggml.
  • Support for DINO models from huggingface with conversion from pytorch weights to gguf.
  • 4-bit, 5-bit and 8-bit quantization support.

DINOv2 Overview

The implemented architecture is based on the DINOv2 architecture:

Quick example

Feature Extraction

example input

PCA output

Classification Output

  $ ./bin/dinov2 -t 4 -m ../ggml-model.gguf -i ../assets/tench.jpg 
  main: seed = 42
  main: loaded image '../assets/tench.jpg' (408 x 612)
  dino_model_load: loading model from '../ggml-model.gguf' - please wait
  dino_model_load: hidden_size            = 384
  dino_model_load: num_hidden_layers      = 12
  dino_model_load: num_register_tokens    = 4
  dino_model_load: num_attention_heads    = 6
  dino_model_load: patch_size             = 14
  dino_model_load: img_size               = 518
  dino_model_load: ftype                  = 1
  dino_model_load: qntvr                  = 0
  dino_model_load: num_classes            = 1000
  main: preprocessed image (224 x 224)


> tench, Tinca tinca : 0.90
> coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch : 0.05
> goldfish, Carassius auratus : 0.01
> suit, suit of clothes : 0.01
> barracouta, snoek : 0.00

main: graph computation took 349 ms
  

Realtime Demo

demo_video.mp4

Convert PyTorch to GGUF

# clone the repo recursively
git clone --recurse-submodules [email protected]:lavaman131/dinov2.cpp.git

cd dinov2.cpp

uv venv

# for MacOS/Linux
source .venv/bin/activate
# for Windows
.venv\Scripts\activate

uv sync --frozen

# convert the weights to gguf : dinov2 small with patch size of 14 and an image 
# size of 518  
# DINOv2 weights are always fp16
# without registers
python ./scripts/dinov2-to-gguf.py --model_name facebook/dinov2-small-imagenet1k-1-layer
# with registers
python ./scripts/dinov2-to-gguf.py --model_name facebook/dinov2-with-registers-small-imagenet1k-1-layer

Build

Install OpenCV

Refer to instructions on the OpenCV website to install OpenCV on your machine.

Using this table, pick your Operating System, and choose if you are going to build from source or install a prebuilt version. It is recommended to build from source, as prebuilt versions only support Visual Studio. OpenCV provides precise step by step instructions on how to build from source.

Configure Environment Variables

Once you have built OpenCV, you need to configure your environment to locate it. You have two options:

Option 1: Set Path in CMakeLists.txt

Add the following line to your CMakeLists.txt file:

set(OpenCV_DIR /path/to/your/opencv/build/folder)

Replace /path/to/your/opencv/build/folder with the absolute path to your OpenCV build directory.

Option 2: Set System Environment Variables

Alternatively, configure your system environment variables:

  1. Set OpenCV_DIR environment variable to the absolute path of your OpenCV build folder
  2. Add the following directories to your system PATH variable:
    • The absolute path to the OpenCV bin folder
    • The absolute path to the OpenCV lib folder

Note: The bin and lib folders are typically located in the same directory.

Simple Build

Add the -c flag when running inference.cpp to return the output predictions. Omitting the flag (by default) will return the patch tokens.

inference.cpp (Classification)

# on MacOS/Linux 
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 4
./bin/inference -m ../ggml-model.gguf -i ../assets/tench.jpg -c
# on Windows
mkdir build ; cd build
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release ..
ninja
./bin/inference.exe -m ../ggml-model.gguf -i ../assets/tench.jpg -c

inference.cpp (Feature Extraction)

# on MacOS/Linux 
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 4
./bin/inference -m ../ggml-model.gguf -i ../assets/tench.jpg
# on Windows
mkdir build ; cd build
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release ..
ninja
./bin/inference.exe -m ../ggml-model.gguf -i ../assets/tench.jpg

realtime.cpp (Live Feature Extraction)

# on MacOS/Linux 
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 4
./bin/realtime -m ../ggml-model.gguf -i ../assets/tench.jpg
# on Windows
mkdir build ; cd build
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release ..
ninja
./bin/realtime.exe -m ../ggml-model.gguf -i ../assets/tench.jpg

The optimal number of threads to use depends on many factors and more is not always better. Usually using a number of threads equal to the number of available physical cores gives the best performance in terms of speed.

Per device optimizations

Generate per-device instructions that work best for the given machine rather than using general CPU instructions.

This can be done by specifying -march=native in the compiler flags.

  • Multi-threading and vectorization
  • Loop transformations(unrolling)

For AMD host processors

You can use a specialized compiler released by AMD to make full use of your specific processor's architecture.

Read more here : AMD Optimizing C/C++ and Fortran Compilers (AOCC)

You can follow the given instructions to install the AOCC compiler.

Please note that modern processors tend to see the greatest benefits from a specialized compiler, whereas older CPUs may experience little to no performance improvement.

Using OpenMP

Additionally compile with OpenMP by specifying the -fopenmp flag to the compiler in the CMakeLists file, allowing multithreaded runs. Make sure to also enable multiple threads when running, e.g.:

OMP_NUM_THREADS=4 ./bin/inference -t 4 -m ../ggml-model.bin -i ../assets/tench.jpg

Run

inference.cpp

usage: ./bin/inference [options]

options:
  -h, --help              show this help message and exit
  -m FNAME, --model       model path (default: ../ggml-model.gguf)
  -i FNAME, --inp         input file (default: ../assets/tench.jpg)
  -o FNAME, --out         output file for backbone PCA features (default: pca_visual.png)
  -k N, --topk            top k classes to print (default: 5)
  -t N, --threads         number of threads to use during computation (default: 4)
  -c, --classify          whether to classify the image or get backbone PCA features (default: 0)
  -fa, --flash_attn       whether to enable flash_attn, less accurate (default: 0)

realtime.cpp

usage: ./bin/realtime [options]

options:
  -h, --help              show this help message and exit
  -m FNAME, --model       model path (default: ../ggml-model.gguf)
  -t N, --threads         number of threads to use during computation (default: 4)
  -fa, --flash_attn       whether to enable flash_attn, less accurate (default: 0)
  -cid, --camera_id       the idea of the camera for realtime backbone PCA feature streaming (default: 0)

Benchmark against PyTorch

First experiments on Intel Core i9-14900HX show inference speedups (up to 3x faster for small model, ~1.5-2x faster for the rest) compared to native PyTorch inference.

DINOv2 inference

You can efficiently run DINOv2 inference on the CPU.

Memory requirements and inference speed on Intel Core i9-14900HX (24 cores, 32 threads) for both native PyTorch and dinov2.cpp. Using a thread count greater than 10 provides marginal improvements, but 24 threads were used for these runs. The reported results of inference speed correspond to 100 runs averages for both PyTorch and dinov2.cpp.

DINOv2 with Register Tokens

Model Max Mem(PyTorch) Max Mem Speed(PyTorch) Speed
small ~457 MB ~109 MB 297 ms 64 ms
base ~720 MB ~367 MB 436 ms 200 ms
large ~1.57 GB ~1.2 GB 1331 ms 597 ms
giant ~4.8 GB ~4.4 GB 4472 ms 1995 ms

Note: The models used are of the form dinov2-with-registers-{size}-imagenet1k-1-layer

DINOv2 without Register Tokens

Model Max Mem(PyTorch) Max Mem Speed(PyTorch) Speed
small ~455 MB ~110 MB 181 ms 62 ms
base ~720 MB ~367 MB 462 ms 197 ms
large ~1.55 GB ~1.2 GB 1288 ms 600 ms
giant ~4.8 GB ~4.4 GB 4384 ms 1969 ms

Note: The models used are of the form dinov2-{size}-imagenet1k-1-layer.

Benchmark on your machine

In order to test the inference speed on your machine, you can run the following scripts:

chmod +x scripts/benchmark.*

# install memory_profiler & threadpoolctl
pip install memory_profiler threadpoolctl

# run the benchmark of PyTorch
python scripts/benchmark.py

# run the benchmark of dinov2.cpp for non-quantized model
./scripts/benchmark.sh

# to run the benchamrk for quantized models; 4 threads and quantize flag
./scripts/benchmark.sh 4 1

Both scripts use 4 threads by default. In Python, the threadpoolctl library is used to limit the number of threads used by PyTorch.

Quantization

dinov2.cpp supports quantization strategies from ggml such as q4_0, q4_1, q5_0, q5_1 and q8_0 types. You can quantize a model in F32 (the patch embedding is in F16) to one of these types by using the ./bin/quantize binary.

usage: ./bin/quantize /path/to/ggml-model.gguf /path/to/ggml-model-quantized.gguf type                              
  type = 2 - q4_0                                                                                                       
  type = 3 - q4_1                                                                                                       
  type = 6 - q5_0                                                                                                       
  type = 7 - q5_1                                                                                                       
  type = 8 - q8_0                                                                                                       

For example, you can run the following to convert the model to q5_1:

./bin/quantize ../ggml-model.gguf ../ggml-model-quant.gguf 7

Then you can use ggml-model-quant.gguf just like the model in F16.

Results

Here are the benchmarks for the different models and quantizations on my machine: For accurate estimation of run times, these benchmarks were run 100 times each.

DINOv2 with Register Tokens

Model Quantization Speed (ms) Mem (MB)
small q4_0 52 49
small q4_1 50 52
small q5_0 59 54
small q5_1 57 57
small q8_0 51 70
base q4_0 136 129
base q4_1 133 139
base q5_0 164 150
base q5_1 158 160
base q8_0 124 211
large q4_0 395 371
large q4_1 395 407
large q5_0 493 443
large q5_1 490 480
large q8_0 353 661
giant q4_0 1275 1281
giant q4_1 1261 1417
giant q5_0 1615 1552
giant q5_1 1583 1687
giant q8_0 1065 2364

DINOv2 without Register Tokens

Model Quantization Speed (ms) Mem (MB)
small q4_0 46 49
small q4_1 48 51
small q5_0 63 54
small q5_1 58 57
small q8_0 50 70
base q4_0 141 129
base q4_1 135 140
base q5_0 162 150
base q5_1 161 160
base q8_0 125 212
large q4_0 389 371
large q4_1 382 407
large q5_0 497 444
large q5_1 478 480
large q8_0 348 661
giant q4_0 1268 1281
giant q4_1 1248 1417
giant q5_0 1625 1553
giant q5_1 1576 1688
giant q8_0 1059 2364

This project was built on and highly inspired by vit.cpp:

About

DINOv2 inference engine written in C/C++ using ggml and OpenCV.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •