Quick Start


  • Docker


cp .env.example .env
docker-compose up -d

Crawl documents for knowledge base

  • To crawl the document for the knowledge base, prepare Firecrawl endpoint by self-hosting or using the public endpoint.
    • For self-hosting, please refer to the following links:
    • We confirmed the crawling with the revision fc08ff450da50eb436d9dfd4a09ac741fd8fbb84 of Firecrawl and other revision may not work correctly.
    • After deployment, please change the following line in docker-compose.yml to deploy the worker service in production mode.
      • This setting is important because the worker service in develogment mode does not automatically restart when app crashes by error.
    <<: *common-service
      - redis
      - playwright-service
      - api
#    command: [ "pnpm", "run", "workers" ] # Original setting for development
    command: [ "pnpm", "run", "worker:production" ] # Modified setting for production
    restart: unless-stopped                # Add this line
  • Run the following command to crawl the documents.
    • If you prepare the firecrawl endpoint other than localhost:3002, specify the endpoint by --firecrawl-host option.
    • Note that you should execute this command outside the docker container to access the local firecrawl endpoint.
python <URL_TO_START_CRAWLING> <OUTPUT_FILE_NAME> --max-page-count <MAX_PAGE_COUNT> --max-depth <MAX_DEPTH>
  • For example:
python "" tokyo.json --max-page-count 1000 --max-depth 5
  • The crawled documents will be saved in tokyo.json and PDFs are saved in downloaded_pdfs directory.

If you want to crawl all prefectures at once, you can use the following command:


Upload knowledge base to Dify

  • To upload the knowledge base to Dify, update the .env file with the required values


  • Run the following command to upload the knowledge base to Dify
    • Note that you should execute this command in the root directory of this project because the JSON file includes relative paths to the PDFs.


  • For example:

python tokyo.json tokyo-knowledges

  • If you want to upload all prefectures at once, you can use the following command:
  • After upload, please execute the following SQL query to remove the suffix .added_on_upload.html from the document names.
  • This process is necessary because the URL without extension like '' is not accepted by Dify and the suffix ".added_on_upload.html" is added to the document names.
UPDATE documents
SET name = LEFT(name, LENGTH(name) - LENGTH('.added_on_upload.html'))
WHERE name LIKE '%.added_on_upload.html';
  • At Dify 0.6.15, The created knowledge base has only_me visibility by default and visible only for the owner of Dify workspace.
  • If you cannot see the uploaded knowledge base, please execute the following SQL query to change the visibility.
UPDATE datasets set permission = 'all_team_members' WHERE name = '<KNOWLEDGE_BASE_NAME>';


  • To evaluate the accuracy of your Dify chatbot, update the .env file with the required values
DIFY_USER=<DIFY_USER_NAME> # Just for logging purposes. Can be anything
AZURE_DEPLOYMENT_ID=<YOUR_AZURE_DEPLOYMENT_ID> # The deployment id of your LLM model. This model evaluates the chatbot responses
  • Restart the container
docker-compose restart
  • Run the evaluation script
    • The evaluation data should be CSV or JSON. The required fields are query, expected_answer. The optional field is source_url_list.
docker-compose exec app python <path_to_your_evaluation_data>
  • The evaluation results will be located in evaluation_results_<timestamp>.json

Test data generation

  • To create test data for the evaluation from a document automatically, run the following command :
docker compose exec app python <path_to_your_document> -o <output_file_name>
  • If you want to generate test data in Japanese, add --use_japanese option.
docker compose exec app python --use_japanese <path_to_your_document> -o <output_file_name>

Known problems and solutions

  • Firecrawl sometimes crashes and crawling stops. In this case, you can restart the crawling by restarting firecrawl containers and executing the script again.
  • Weaviate in Dify sometimes refuse the connetcion. In this case, you can restart the weaviate container and execute the script again.