Skip to content

Tasks related to machine learning are not functioning properly. #10343

@Tql-ws1

Description

@Tql-ws1

The bug

When attempting to speed up machine learning tasks using CUDA, the 'immich-machine-learning' reports an error as follows: 'Worker (pid:5) was sent code 139!'

The 'immich-server' is indicating errors like this: "ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed."

The OS that Immich Server is running on

Arch

Version of Immich Server

v1.106.4

Version of Immich Mobile App

N/A

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    extends:
      file: hwaccel.transcoding.yml
      service: nvenc # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    ports:
      - 15002:3001
    depends_on:
      - redis
      - database
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-cuda
    extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
      file: hwaccel.ml.yml
      service: cuda # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always

  redis:
    container_name: immich_redis
    image: docker.io/redis:6.2-alpine@sha256:d6c2911ac51b289db208767581a5d154544f2b2fe4914ea5056443f62dc6e900
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always

  database:
    container_name: immich_postgres
    image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    healthcheck:
      test: pg_isready --dbname='${DB_DATABASE_NAME}' || exit 1; Chksum="$$(psql --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' --tuples-only --no-align --command='SELECT COALESCE(SUM(checksum_failures), 0) FROM pg_stat_database')"; echo "checksum failure count is $$Chksum"; [ "$$Chksum" = '0' ] || exit 1
      interval: 5m
      start_interval: 30s
      start_period: 5m
    command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always

volumes:
  model-cache:

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library
# The location where your database files are stored
DB_DATA_LOCATION=./postgres

# To set a timezone, uncomment the next line and change Etc/UTC to a TZ identifier from this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List
TZ=***/***

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=******

# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

Reproduction steps

1. $ podman-compose -f docker-compose.yml.gpu up
2. Access the immich WebUI, navigate to the "Administration Settings" and select the "Jobs" option. Then run either the "Smart Search" or "Face Detection" task.
3. Inspect the container's running logs on the terminal.

Relevant log output

[immich-server]           | [Nest] 12  - 06/15/2024, 5:48:15 PM     LOG [Api:NestApplication] Nest application successfully started
[immich-server]           | [Nest] 12  - 06/15/2024, 5:48:15 PM     LOG [Api:Bootstrap] Immich Server is listening on http://[::1]:3001 [v1.106.4] [PRODUCTION]
[immich-server]           | [Nest] 12  - 06/15/2024, 5:48:28 PM     LOG [Api:EventRepository] Websocket Connect:    _U400IFfpBR3GxgxAAAB
[immich-machine-learning] | [06/15/24 09:52:16] INFO     Setting 'XLM-Roberta-Large-Vit-B-16Plus' execution
[immich-machine-learning] |                              providers to ['CUDAExecutionProvider',
[immich-machine-learning] |                              'CPUExecutionProvider'], in descending order of
[immich-machine-learning] |                              preference
[immich-machine-learning] | [06/15/24 09:52:16] INFO     Loading visual model
[immich-machine-learning] |                              'XLM-Roberta-Large-Vit-B-16Plus' to memory
[immich-machine-learning] | [06/15/24 09:52:22] ERROR    Worker (pid:5) was sent code 139!
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:22 PM   ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:22 PM   ERROR [Microservices:JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server]           |     at /usr/src/app/dist/repositories/machine-learning.repository.js:19:19
[immich-server]           |     at async MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:18:21)
[immich-server]           |     at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:42:26)
[immich-server]           |     at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:86:27)
[immich-server]           |     at async /usr/src/app/dist/services/job.service.js:148:36
[immich-server]           |     at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
[immich-server]           |     at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:22 PM   ERROR [Microservices:JobService] Object:
[immich-server]           | {
[immich-server]           |   "id": "f77bc1f0-bdb9-4040-801a-37c7719e1423"
[immich-server]           | }
[immich-server]           |
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:22 PM   ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:22 PM   ERROR [Microservices:JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server]           |     at /usr/src/app/dist/repositories/machine-learning.repository.js:19:19
[immich-server]           |     at async MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:18:21)
[immich-server]           |     at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:42:26)
[immich-server]           |     at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:86:27)
[immich-server]           |     at async /usr/src/app/dist/services/job.service.js:148:36
[immich-server]           |     at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
[immich-server]           |     at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:22 PM   ERROR [Microservices:JobService] Object:
[immich-server]           | {
[immich-server]           |   "id": "542539b8-9407-490a-b36e-124c7dfadcea"
[immich-server]           | }
[immich-server]           |
[immich-machine-learning] | [06/15/24 09:52:22] INFO     Booting worker with pid: 38
[immich-machine-learning] | [06/15/24 09:52:26] INFO     Started server process [38]
[immich-machine-learning] | [06/15/24 09:52:26] INFO     Waiting for application startup.
[immich-machine-learning] | [06/15/24 09:52:26] INFO     Created in-memory cache with unloading after 300s
[immich-machine-learning] |                              of inactivity.
[immich-machine-learning] | [06/15/24 09:52:26] INFO     Initialized request thread pool with 12 threads.
[immich-machine-learning] | [06/15/24 09:52:26] INFO     Application startup complete.
[immich-machine-learning] | [06/15/24 09:52:27] INFO     Setting 'XLM-Roberta-Large-Vit-B-16Plus' execution
[immich-machine-learning] |                              providers to ['CUDAExecutionProvider',
[immich-machine-learning] |                              'CPUExecutionProvider'], in descending order of
[immich-machine-learning] |                              preference
[immich-machine-learning] | [06/15/24 09:52:27] INFO     Loading visual model
[immich-machine-learning] |                              'XLM-Roberta-Large-Vit-B-16Plus' to memory
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:32 PM   ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:32 PM   ERROR [Microservices:JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server]           |     at /usr/src/app/dist/repositories/machine-learning.repository.js:19:19
[immich-server]           |     at async MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:18:21)
[immich-server]           |     at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:42:26)
[immich-server]           |     at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:86:27)
[immich-server]           |     at async /usr/src/app/dist/services/job.service.js:148:36
[immich-server]           |     at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
[immich-server]           |     at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[immich-server]           | [Nest] 2  - 06/15/2024, 5:52:32 PM   ERROR [Microservices:JobService] Object:
[immich-server]           | {
[immich-server]           |   "id": "731870bb-f8a0-4b89-8a46-8255f4fd0c43"
[immich-server]           | }
[immich-server]           |
[immich-machine-learning] | [06/15/24 09:52:32] INFO     Booting worker with pid: 69
[immich-machine-learning] | [06/15/24 09:52:36] INFO     Started server process [69]
[immich-machine-learning] | [06/15/24 09:52:36] INFO     Waiting for application startup.
[immich-machine-learning] | [06/15/24 09:52:36] INFO     Created in-memory cache with unloading after 300s
[immich-machine-learning] |                              of inactivity.
[immich-machine-learning] | [06/15/24 09:52:36] INFO     Initialized request thread pool with 12 threads.
[immich-machine-learning] | [06/15/24 09:52:36] INFO     Application startup complete.

Additional information

The prerequisites for using CUDA to accelerate machine learning tasks are satisfied. However, there is an issue when hardware acceleration is not being used (using the default configuration in docker-compose.yml) as no such problem arises.

❯ nvidia-container-cli info
NVRM version:   550.90.07
CUDA version:   12.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions