Skip to content

Official DINO-X Model Context Protocol (MCP) server that empowers LLMs with real-world visual perception through image object detection, localization, and captioning APIs.

License

Notifications You must be signed in to change notification settings

IDEA-Research/DINO-X-MCP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DINO-X MCP

License npm version npm downloads PRs Welcome GitHub stars

English | δΈ­ζ–‡

Enables large language models to perform fine-grained object detection and image understanding, powered by DINO-X and Grounding DINO 1.6 API.

πŸ’‘ Why DINO-X MCP?

Although multimodal models can understand and describe images, they often lack precise localization and high-quality structured outputs for visual content.

With DINO-X MCP, you can:

🧠 Achieve fine-grained image understanding β€” both full-scene recognition and targeted detection based on natural language.

🎯 Accurately obtain object count, position, and attributes, enabling tasks such as visual question answering.

🧩 Integrate with other MCP Servers to build multi-step visual workflows.

πŸ› οΈ Build natural language-driven visual agents for real-world automation scenarios.

🎬 Use Case

🎯 Scenario πŸ“ Input ✨ Output
Detection & Localization πŸ’¬ Prompt:
Detect and visualize the
fire areas in the forest

πŸ–ΌοΈ Input Image:
1-1
1-2
Object Counting πŸ’¬ Prompt:
Please analyze this
warehouse image, detect
all the cardboard boxes,
count the total number

πŸ–ΌοΈ Input Image:
2-1
2-2
Feature Detection πŸ’¬ Prompt:
Find all red cars
in the image

πŸ–ΌοΈ Input Image:
4-1
4-2
Attribute Reasoning πŸ’¬ Prompt:
Find the tallest person
in the image, describe
their clothing

πŸ–ΌοΈ Input Image:
5-1
5-2
Full Scene Detection πŸ’¬ Prompt:
Find the fruit with
the highest vitamin C
content in the image

πŸ–ΌοΈ Input Image:
6-1
6-3

Answer: Kiwi fruit (93mg/100g)
Pose Analysis πŸ’¬ Prompt:
Please analyze what
yoga pose this is

πŸ–ΌοΈ Input Image:
3-1
3-3

πŸš€ Quick Start

1. Prerequisites

You can install Node.js using one of the following methods:

Option A: Command πŸ‘

# For MacOS or Linux
# 1. Install nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
# OR
wget -qO- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash

# 2. Add these lines to your profile (~/.bash_profile, ~/.zshrc, ~/.profile, or ~/.bashrc)
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  

# 3. Activate nvm in current shell
source ~/.bashrc
# Or
source ~/.zshrc   

# 4. Verify nvm installation
command -v nvm

# 5. Install and use LTS version of Node.js
nvm install --lts
nvm use --lts

# For Windows
winget install OpenJS.NodeJS.LTS
# Or using PowerShell (Administrator)
iwr -useb https://raw.githubusercontent.com/chocolatey/chocolatey/master/chocolateyInstall/InstallChocolatey.ps1 | iex
choco install nodejs-lts -y

Option B: Manual Installation

Download the installer from nodejs.org

Also, choose an AI assistants and applications that support the MCP Client, including but not limited to:

2. Configure MCP Sever

You can use DINO-X MCP server in two ways:

Option A: Using NPM Package πŸ‘

Add the following configuration in your MCP client:

{
  "mcpServers": {
    "dinox-mcp": {
      "command": "npx",
      "args": ["-y", "@deepdataspace/dinox-mcp"],
      "env": {
        "DINOX_API_KEY": "your-api-key-here",
        "IMAGE_STORAGE_DIRECTORY": "/path/to/your/image/directory"
      }
    }
  }
}

Option B: Using Local Project

First, clone and build the project:

# Clone the project
git clone https://github.com/IDEA-Research/DINO-X-MCP.git
cd DINO-X-MCP

# Install dependencies
pnpm install

# Build the project
pnpm run build

Then configure your MCP client:

{
  "mcpServers": {
    "dinox-mcp": {
      "command": "node",
      "args": ["/path/to/DINO-X-MCP/build/index.js"],
      "env": {
        "DINOX_API_KEY": "your-api-key-here",
        "IMAGE_STORAGE_DIRECTORY": "/path/to/your/image/directory"
      }
    }
  }
}

3. Get API Key

Get your API key from DINO-X Platform (A free quota is available for new users).

Replace your-api-key-here in the configuration above with your actual API key.

4. Environment Variables

The DINO-X MCP server supports the following environment variables:

Variable Name Description Required Default Value Example
DINOX_API_KEY Your DINO-X API key for authentication Required - your-api-key-here
IMAGE_STORAGE_DIRECTORY Directory where generated visualization images will be saved Optional macOS/Linux: /tmp/dinox-mcp
Windows: %TEMP%\dinox-mcp
/Users/admin/Downloads/dinox-images

5. Available Tools

Restart your MCP client, and you should be able to use the following tools:

Method Name Description Input Output
detect-all-objects Detects and localizes all recognizable objects in an image. Image Category names + bounding boxes + captions
object-detection-by-text Detects and localizes objects in an image based on a natural language prompt. Image + Text prompt Bounding boxes + object captions
detect-human-pose-keypoints Detects 17 human body keypoints per person in an image for pose estimation. Image Keypoint coordinates and captions
visualize-detections Visualizes detection results by drawing bounding boxes and labels on the image. Image + Detection results Annotated image saved to storage directory

πŸ“ Usage

Supported Image Formats

  • Remote URLs starting with https:// πŸ‘
  • Local file paths (starting with file://)
  • Common image formats: jpg, jpeg, png, webp

API Docs

Please refer to DINO-X Platform for API usage limits and pricing information.

πŸ› οΈ Development

Watch Mode

During development, you can use watch mode for automatic rebuilding:

pnpm run watch

Debugging

Use MCP Inspector to debug the server:

pnpm run inspector

License

Apache License 2.0

About

Official DINO-X Model Context Protocol (MCP) server that empowers LLMs with real-world visual perception through image object detection, localization, and captioning APIs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published