Skip to content

Latest commit

 

History

History

README.md

Data-Juicer Q&A Copilot

Q&A Copilot is the intelligent question-answering component of the Data-Juicer Agents system, a professional Data-Juicer AI assistant built on the AgentScope framework.

You can chat with our Q&A Copilot Juicer on the official documentation site of Data-Juicer! Feel free to ask Juicer anything related to Data-Juicer ecosystem.

Core Components

  • Agent: Intelligent Q&A agent based on ReActAgent
  • FAQ RAG System: Fast and accurate FAQ retrieval powered by Qdrant vector database and DashScope text embedding model
  • MCP Integration: Online GitHub search capabilities through GitHub MCP Server
  • Redis Storage: Supports session history and feedback data persistence
  • Web API: Provides RESTful interfaces for frontend integration

Quick Start

Prerequisites

  • 3.10 <= Python <= 3.12
  • Docker (for running Qdrant vector database)
  • Redis server (optional, activated by SESSION_STORE_TYPE=redis)
  • DashScope API Key (for large language model calls and text embedding)

Installation

  1. Install dependencies

    cd ..
    uv pip install .[qa]
    cd qa-copilot
  2. Install Docker (for Qdrant vector database)

    # Ubuntu/Debian
    sudo apt-get install docker.io
    sudo systemctl start docker
    
    # macOS
    brew install docker

    Note: The system will automatically check and start the Qdrant Docker container on startup. If FAQ data is not initialized, the system will automatically read from qa-copilot/rag_utils/faq.txt and initialize the RAG data.

  3. Install and start Redis (optional - skip if using the default SESSION_STORE_TYPE=json)

    # Ubuntu/Debian
    sudo apt-get install redis-server
    redis-server --daemonize yes
    
    # macOS
    brew install redis
    brew services start redis

    Note:

    • If you set SESSION_STORE_TYPE=json (default), session history will be stored as JSON files in the SESSION_STORE_DIR directory with automatic TTL-based cleanup.
    • If you set SESSION_STORE_TYPE=redis, you need to have Redis server running. Session state is automatically managed by RedisMemory, and TTL is handled by Redis server configuration.

Configuration

  1. Set required environment variables

    export DASHSCOPE_API_KEY="your_dashscope_api_key"
    export GITHUB_TOKEN="your_github_token"  # Required: for GitHub MCP integration
  2. Set optional environment variables

    Session Storage Configuration:

    # Session store type: "json" (default) or "redis"
    export SESSION_STORE_TYPE="json"  # or "redis"
    
    # For JSON mode (default):
    export SESSION_STORE_DIR="./sessions"  # Session file storage directory (default: "./sessions")
    export SESSION_TTL_SECONDS="21600"  # Session TTL in seconds (default: 21600 = 6 hours)
    export SESSION_CLEANUP_INTERVAL="1800"  # Cleanup interval in seconds (default: 1800 = 30 minutes)
    
    # For Redis mode:
    export REDIS_HOST="localhost"  # Redis server host (default: "localhost")
    export REDIS_PORT="6379"  # Redis server port (default: 6379)
    export REDIS_DB="0"  # Redis database number (default: 0)
    export REDIS_PASSWORD=""  # Redis password (default: None, optional)
    export REDIS_MAX_CONNECTIONS="10"  # Redis max connections (default: 10)
    # Note: Redis TTL is handled by Redis server configuration, not by application

    Model Configuration:

    export MAX_TOKENS="200000"  # Maximum tokens for context window (default: 200000)
    # Note: This value is multiplied by 3 when passed to DashScopeChatFormatter
    # because CharTokenCounter counts characters, and ~3 chars ≈ 1 token for mixed CHN & ENG text

    Qdrant Vector Database:

    export QDRANT_HOST="127.0.0.1"  # Qdrant server host (default: "127.0.0.1")
    export QDRANT_PORT="6333"  # Qdrant server port (default: 6333)

    Service Configuration:

    export DJ_COPILOT_SERVICE_HOST="127.0.0.1"  # Service host address (default: "127.0.0.1")
    export DJ_COPILOT_ENABLE_LOGGING="true"  # Enable session logging (default: "true")
    export DJ_COPILOT_LOG_DIR="./logs"  # Log directory (default: "./logs")

    Advanced Configuration:

    export FASTAPI_CONFIG_PATH=""  # Path to FastAPI config JSON file (optional)
    export SAFE_CHECK_HANDLER_PATH=""  # Path to custom safe check handler module (optional)
  3. Configure FAQ file (optional)

    The system uses qa-copilot/rag_utils/faq.txt as the FAQ data source by default. You can edit this file to customize FAQ content. FAQ file format example:

    'id': 'FAQ_001', 'question': 'What is Data-Juicer?', 'answer': 'Data-Juicer is a...'
    'id': 'FAQ_002', 'question': 'How to install?', 'answer': 'You can install by...'
    
  4. Start the service

    bash setup_server.sh

    On first startup, the system will automatically:

    • Check and start the Qdrant Docker container (port 6333)
    • Initialize FAQ RAG data (if not already initialized)
    • Start the Web API service

Usage

Web API Interfaces

After starting the service, the system provides the following API interfaces:

1. Q&A Conversation

POST /process
Content-Type: application/json

{
  "input": [
    {
      "role": "user", 
      "content": [{"type": "text", "text": "How to use Data-Juicer for data cleaning?"}]
    }
  ],
  "session_id": "your_session_id",
  "user_id": "user_id"
}

2. Get Session History

POST /memory
Content-Type: application/json

{
  "session_id": "your_session_id",
  "user_id": "user_id"
}

3. Clear Session History

POST /clear
Content-Type: application/json

{
  "session_id": "your_session_id",
  "user_id": "user_id"
}

4. Submit User Feedback

POST /feedback
Content-Type: application/json

{
  "data": {
    "message_id": "message_id_here",
    "feedback_type": "like",
    "comment": "optional user comment"
  },
  "session_id": "your_session_id",
  "user_id": "user_id"
}

Parameters:

  • message_id: The ID of the message to provide feedback on (required)
  • feedback_type: Type of feedback, either "like" or "dislike" (required)
  • comment: Optional user comment text (optional)

Response example:

{
  "status": "ok",
  "message": "Feedback recorded successfully"
}

WebUI

you can simply run the following command in your terminal:

npx @agentscope-ai/chat agentscope-runtime-webui --url http://localhost:8080/process

Refer to AgentScope Runtime WebUI for more information.

Configuration Details

Environment Variables Summary

Variable Required Default Description
DASHSCOPE_API_KEY ✅ Yes - DashScope API key for LLM and embedding
GITHUB_TOKEN ✅ Yes - GitHub token for MCP integration
SESSION_STORE_TYPE ❌ No "json" Session storage type: "json" or "redis"
SESSION_STORE_DIR ❌ No "./sessions" Session file directory (JSON mode only)
SESSION_TTL_SECONDS ❌ No 21600 Session TTL in seconds (JSON mode only, 6 hours)
SESSION_CLEANUP_INTERVAL ❌ No 1800 Cleanup interval in seconds (JSON mode only, 30 minutes)
REDIS_HOST ❌ No "localhost" Redis server host (Redis mode only)
REDIS_PORT ❌ No 6379 Redis server port (Redis mode only)
REDIS_DB ❌ No 0 Redis database number (Redis mode only)
REDIS_PASSWORD ❌ No None Redis password (Redis mode only, optional)
REDIS_MAX_CONNECTIONS ❌ No 10 Redis max connections (Redis mode only)
QDRANT_HOST ❌ No "127.0.0.1" Qdrant server host
QDRANT_PORT ❌ No 6333 Qdrant server port
MAX_TOKENS ❌ No 200000 Maximum tokens for context window (multiplied by 3 for CharTokenCounter)
DJ_COPILOT_SERVICE_HOST ❌ No "127.0.0.1" Service host address
DJ_COPILOT_ENABLE_LOGGING ❌ No "true" Enable session logging
DJ_COPILOT_LOG_DIR ❌ No "./logs" Log directory
FASTAPI_CONFIG_PATH ❌ No "" Path to FastAPI config JSON file
SAFE_CHECK_HANDLER_PATH ❌ No "" Path to custom safe check handler

Model Configuration

In app_deploy.py, you can configure the language model to use:

model=DashScopeChatModel(
    "qwen3-max-2026-01-23",  # Model name
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    stream=True,  # Enable streaming response
    enable_thinking=True,  # Enable thinking mode
)

The formatter uses MAX_TOKENS environment variable (default: 200000) to limit the context window size. Since CharTokenCounter counts characters and approximately 3 characters ≈ 1 token for mixed Chinese and English text, the value is multiplied by 3 when passed to DashScopeChatFormatter.

Session Storage Configuration

JSON Mode (Default):

  • Session history is stored as JSON files in SESSION_STORE_DIR directory
  • Automatic TTL-based cleanup runs every SESSION_CLEANUP_INTERVAL seconds
  • Sessions expire after SESSION_TTL_SECONDS seconds of inactivity
  • No external dependencies required

Redis Mode:

  • Session history is stored in Redis
  • Session state is automatically managed by RedisMemory
  • TTL is handled by Redis server configuration (not application-level)
  • Requires Redis server to be running

FAQ RAG Configuration

The FAQ RAG system uses the following configuration:

  • Vector Database: Qdrant (running in Docker container)
  • Embedding Model: DashScope text-embedding-v4
  • Vector Dimension: 1024
  • Data Source: qa-copilot/rag_utils/faq.txt
  • Storage Location: qa-copilot/rag_utils/qdrant_storage
  • Qdrant Host: Configurable via QDRANT_HOST (default: 127.0.0.1)
  • Qdrant Port: Configurable via QDRANT_PORT (default: 6333)

The system automatically checks if RAG data is initialized on startup. If not initialized, it will automatically read the FAQ file and create vector indexes.

Troubleshooting

Common Issues

  1. Docker/Qdrant Issues

    • Ensure Docker service is running: docker --version
    • Check Qdrant container status: docker ps | grep qdrant
    • Manually start Qdrant container: docker start qdrant
    • Check if Qdrant port is occupied: netstat -tlnp | grep 6333
    • To reinitialize RAG data, delete the qa-copilot/rag_utils/qdrant_storage directory and restart the service
  2. Redis connection failure (when using SESSION_STORE_TYPE=redis)

    • Ensure Redis service is running: redis-cli ping
    • Check if Redis port is occupied: netstat -tlnp | grep 6379 (or your configured REDIS_PORT)
    • Verify Redis configuration: Check REDIS_HOST, REDIS_PORT, REDIS_DB, and REDIS_PASSWORD environment variables
    • Note: Redis TTL is managed by Redis server, not by the application
  3. MCP service startup failure

    • Ensure GITHUB_TOKEN is set and correct (required environment variable)
    • Verify GitHub token has necessary permissions for MCP integration
  4. API Key error

    • Verify DASHSCOPE_API_KEY environment variable is correctly set
    • Confirm API Key is valid and has sufficient quota
  5. FAQ retrieval returns no results

    • Confirm FAQ file qa-copilot/rag_utils/faq.txt exists and is properly formatted
    • Check if Qdrant container is running normally
    • Review logs to confirm RAG data was successfully initialized

Acknowledgments

Parts of this project's code are adapted from the following open-source projects:

Special thanks to the AgentScope team for their excellent framework and sample code!

License

This project uses the same license as the main project. For details, please refer to the LICENSE file.

Related Links