EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computation.
Option 1
If you want to work within the repository, use uv to install the necessary dependencies.
uv syncOption 2
You can also install it as an external dependency for another project:
# Option 2.a
python -m pip install git@https://github.com/encord-team/ebind
# Option 2.b; or install a local, editable version
git clone https://github.com/encord-team/ebind
cd /path/to/your/project
python -m pip install -e /path/to/ebindWarning
If you are running a project with pytorch~=2.8.0, you should install torchcodec~=0.7.0 (as opposed to the ~=0.8.0)
which is automatically installed with uv. torchcodec~=0.8.* matches pytorch~=2.9.0.
Note
The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile][#compile-pointnet2-cuda-ops-optional]. To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels.
import torch
from ebind import EBindModel, EBindProcessor
model = EBindModel.from_pretrained("encord-team/ebind-full")
processor = EBindProcessor.from_pretrained("encord-team/ebind-full")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()
processor = processor.to(device)inputs = {
"image": ["examples/dog.png", "examples/cat.png"],
"video": ["examples/dog.mp4", "examples/cat.mp4"],
"audio": ["examples/dog.mp4", "examples/cat.mp4"],
"text": ["A dog is howling in the street", "A cat is sleeping on the couch"],
"points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"],
}
with torch.inference_mode():
batch = processor(inputs, return_tensors="pt") # set text_file_paths=True if passing text file paths instead of strings
outputs = model.forward(**batch)keys = list(outputs.keys())
for i, modality in enumerate(keys):
for j, modality2 in enumerate(keys[i + 1:]):
result = outputs[modality] @ outputs[modality2].T
print(f"{modality} x {modality2}:")
print(result.cpu().detach().numpy())
print('='*26)Expected Output:
image x video similarity:
[[0.48 0.42]
[0.41 0.6 ]]
==========================
image x audio similarity:
[[0.07 0.05]
[0.02 0.12]]
==========================
image x text similarity:
[[0.16 0.07]
[0.08 0.14]]
==========================
image x points similarity:
[[0.2 0.19]
[0.18 0.19]]
==========================
video x audio similarity:
[[0.19 0.08]
[0.03 0.16]]
==========================
video x text similarity:
[[0.26 0.05]
[0.11 0.14]]
==========================
video x points similarity:
[[0.24 0.15]
[0.17 0.26]]
==========================
audio x text similarity:
[[ 0.12 -0. ]
[ 0.07 0.09]]
==========================
audio x points similarity:
[[0.13 0.06]
[0.1 0.12]]
==========================
text x points similarity:
[[0.19 0.14]
[0.05 0.18]]
==========================
Note: The image/video similarity is significantly higher because they share the same vision encoder.
If you have CUDA available, consider building the PointNet2 custom ops used for embedding point clouds to get faster inference:
cd src/ebind/models/uni3d/pointnet2_ops && \
uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \
MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplaceWe have modified the code slightly in
src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.pyto have a fallback torch implementation in order for the model to be executable on no-GPU hardware.
We welcome contributions! If you have suggestions for improvements, new features, or bug fixes, feel free to open an issue or pull request. Please follow the standard GitHub workflow and adhere to our code style and guidelines. For major changes, we recommend discussing them in an issue before submitting a PR.
- Fork the repository.
- Create your feature branch:
git checkout -b my-feature - Commit your changes:
git commit -m 'Add some feature' - Push to the branch:
git push origin my-feature - Open a pull request describing your changes.
If you use this codebase in your research or work, please cite it as follows (replace with your own citation when available):
@misc{encord-bind,
author = {The Encord Team},
title = {{EBind}: Multi-modal binding and inference},
year = {2025},
howpublished = {\url{https://github.com/encord-team/ebind}},
}This project is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International. See the LICENSE file for details.
