EBind: Multi-Modal Embeddings

EBind: Multi-Modal Embeddings

EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computation.

Installation

Option 1
If you want to work within the repository, use uv to install the necessary dependencies.

uv sync

Option 2
You can also install it as an external dependency for another project:

# Option 2.a
python -m pip install git@https://github.com/encord-team/ebind
# Option 2.b; or install a local, editable version
git clone https://github.com/encord-team/ebind
cd /path/to/your/project
python -m pip install -e /path/to/ebind

Warning

If you are running a project with pytorch~=2.8.0, you should install torchcodec~=0.7.0 (as opposed to the ~=0.8.0) which is automatically installed with uv. torchcodec~=0.8.* matches pytorch~=2.9.0.

Note

The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile][#compile-pointnet2-cuda-ops-optional]. To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels.

Loading the Model

import torch
from ebind import EBindModel, EBindProcessor

model = EBindModel.from_pretrained("encord-team/ebind-full")
processor = EBindProcessor.from_pretrained("encord-team/ebind-full")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()
processor = processor.to(device)

Processing Multi-Modal Inputs

inputs = {
    "image": ["examples/dog.png", "examples/cat.png"],
    "video": ["examples/dog.mp4", "examples/cat.mp4"],
    "audio": ["examples/dog.mp4", "examples/cat.mp4"],
    "text": ["A dog is howling in the street", "A cat is sleeping on the couch"],
    "points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"],
}

with torch.inference_mode():
    batch = processor(inputs, return_tensors="pt")  # set text_file_paths=True if passing text file paths instead of strings
    outputs = model.forward(**batch)

Computing Cross-Modal Similarities

keys = list(outputs.keys())
for i, modality in enumerate(keys):
    for j, modality2 in enumerate(keys[i + 1:]):
        result = outputs[modality] @ outputs[modality2].T
        print(f"{modality} x {modality2}:")
        print(result.cpu().detach().numpy())
        print('='*26)

Expected Output:

image x video similarity: 
[[0.48 0.42]
 [0.41 0.6 ]]
==========================
image x audio similarity: 
[[0.07 0.05]
 [0.02 0.12]]
==========================
image x text similarity: 
[[0.16 0.07]
 [0.08 0.14]]
==========================
image x points similarity: 
[[0.2  0.19]
 [0.18 0.19]]
==========================
video x audio similarity: 
[[0.19 0.08]
 [0.03 0.16]]
==========================
video x text similarity: 
[[0.26 0.05]
 [0.11 0.14]]
==========================
video x points similarity: 
[[0.24 0.15]
 [0.17 0.26]]
==========================
audio x text similarity: 
[[ 0.12 -0.  ]
 [ 0.07  0.09]]
==========================
audio x points similarity: 
[[0.13 0.06]
 [0.1  0.12]]
==========================
text x points similarity: 
[[0.19 0.14]
 [0.05 0.18]]
==========================

Note: The image/video similarity is significantly higher because they share the same vision encoder.

Compile PointNet2 CUDA ops (optional)

If you have CUDA available, consider building the PointNet2 custom ops used for embedding point clouds to get faster inference:

cd src/ebind/models/uni3d/pointnet2_ops && \
    uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \
    MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplace

We have modified the code slightly in src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.py to have a fallback torch implementation in order for the model to be executable on no-GPU hardware.

Contributing

We welcome contributions! If you have suggestions for improvements, new features, or bug fixes, feel free to open an issue or pull request. Please follow the standard GitHub workflow and adhere to our code style and guidelines. For major changes, we recommend discussing them in an issue before submitting a PR.

How to contribute

Fork the repository.
Create your feature branch: git checkout -b my-feature
Commit your changes: git commit -m 'Add some feature'
Push to the branch: git push origin my-feature
Open a pull request describing your changes.

Citation

If you use this codebase in your research or work, please cite it as follows (replace with your own citation when available):

@misc{encord-bind,
  author       = {The Encord Team},
  title        = {{EBind}: Multi-modal binding and inference},
  year         = {2025},
  howpublished = {\url{https://github.com/encord-team/ebind}},
}

License

This project is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
misc		misc
src/ebind		src/ebind
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EBind: Multi-Modal Embeddings

Installation

Loading the Model

Processing Multi-Modal Inputs

Computing Cross-Modal Similarities

Compile PointNet2 CUDA ops (optional)

Contributing

How to contribute

Citation

License

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

encord-team/ebind

Folders and files

Latest commit

History

Repository files navigation

EBind: Multi-Modal Embeddings

Installation

Loading the Model

Processing Multi-Modal Inputs

Computing Cross-Modal Similarities

Compile PointNet2 CUDA ops (optional)

Contributing

How to contribute

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages