-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Closed
Description
The bug
When attempting to speed up machine learning tasks using CUDA, the 'immich-machine-learning' reports an error as follows: 'Worker (pid:5) was sent code 139!'
The 'immich-server' is indicating errors like this: "ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed."
The OS that Immich Server is running on
Arch
Version of Immich Server
v1.106.4
Version of Immich Mobile App
N/A
Platform with the issue
- Server
- Web
- Mobile
Your docker-compose.yml content
name: immich
services:
immich-server:
container_name: immich_server
image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
extends:
file: hwaccel.transcoding.yml
service: nvenc # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
volumes:
- ${UPLOAD_LOCATION}:/usr/src/app/upload
- /etc/localtime:/etc/localtime:ro
env_file:
- .env
ports:
- 15002:3001
depends_on:
- redis
- database
restart: always
immich-machine-learning:
container_name: immich_machine_learning
# For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
# Example tag: ${IMMICH_VERSION:-release}-cuda
image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-cuda
extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
file: hwaccel.ml.yml
service: cuda # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
volumes:
- model-cache:/cache
env_file:
- .env
restart: always
redis:
container_name: immich_redis
image: docker.io/redis:6.2-alpine@sha256:d6c2911ac51b289db208767581a5d154544f2b2fe4914ea5056443f62dc6e900
healthcheck:
test: redis-cli ping || exit 1
restart: always
database:
container_name: immich_postgres
image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
environment:
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_USER: ${DB_USERNAME}
POSTGRES_DB: ${DB_DATABASE_NAME}
POSTGRES_INITDB_ARGS: '--data-checksums'
volumes:
- ${DB_DATA_LOCATION}:/var/lib/postgresql/data
healthcheck:
test: pg_isready --dbname='${DB_DATABASE_NAME}' || exit 1; Chksum="$$(psql --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' --tuples-only --no-align --command='SELECT COALESCE(SUM(checksum_failures), 0) FROM pg_stat_database')"; echo "checksum failure count is $$Chksum"; [ "$$Chksum" = '0' ] || exit 1
interval: 5m
start_interval: 30s
start_period: 5m
command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
restart: always
volumes:
model-cache:
Your .env content
# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables
# The location where your uploaded files are stored
UPLOAD_LOCATION=./library
# The location where your database files are stored
DB_DATA_LOCATION=./postgres
# To set a timezone, uncomment the next line and change Etc/UTC to a TZ identifier from this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List
TZ=***/***
# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release
# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=******
# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich
Reproduction steps
1. $ podman-compose -f docker-compose.yml.gpu up
2. Access the immich WebUI, navigate to the "Administration Settings" and select the "Jobs" option. Then run either the "Smart Search" or "Face Detection" task.
3. Inspect the container's running logs on the terminal.
Relevant log output
[immich-server] | [Nest] 12 - 06/15/2024, 5:48:15 PM LOG [Api:NestApplication] Nest application successfully started
[immich-server] | [Nest] 12 - 06/15/2024, 5:48:15 PM LOG [Api:Bootstrap] Immich Server is listening on http://[::1]:3001 [v1.106.4] [PRODUCTION]
[immich-server] | [Nest] 12 - 06/15/2024, 5:48:28 PM LOG [Api:EventRepository] Websocket Connect: _U400IFfpBR3GxgxAAAB
[immich-machine-learning] | [06/15/24 09:52:16] INFO Setting 'XLM-Roberta-Large-Vit-B-16Plus' execution
[immich-machine-learning] | providers to ['CUDAExecutionProvider',
[immich-machine-learning] | 'CPUExecutionProvider'], in descending order of
[immich-machine-learning] | preference
[immich-machine-learning] | [06/15/24 09:52:16] INFO Loading visual model
[immich-machine-learning] | 'XLM-Roberta-Large-Vit-B-16Plus' to memory
[immich-machine-learning] | [06/15/24 09:52:22] ERROR Worker (pid:5) was sent code 139!
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:22 PM ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:22 PM ERROR [Microservices:JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server] | at /usr/src/app/dist/repositories/machine-learning.repository.js:19:19
[immich-server] | at async MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:18:21)
[immich-server] | at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:42:26)
[immich-server] | at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:86:27)
[immich-server] | at async /usr/src/app/dist/services/job.service.js:148:36
[immich-server] | at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
[immich-server] | at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:22 PM ERROR [Microservices:JobService] Object:
[immich-server] | {
[immich-server] | "id": "f77bc1f0-bdb9-4040-801a-37c7719e1423"
[immich-server] | }
[immich-server] |
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:22 PM ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:22 PM ERROR [Microservices:JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server] | at /usr/src/app/dist/repositories/machine-learning.repository.js:19:19
[immich-server] | at async MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:18:21)
[immich-server] | at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:42:26)
[immich-server] | at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:86:27)
[immich-server] | at async /usr/src/app/dist/services/job.service.js:148:36
[immich-server] | at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
[immich-server] | at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:22 PM ERROR [Microservices:JobService] Object:
[immich-server] | {
[immich-server] | "id": "542539b8-9407-490a-b36e-124c7dfadcea"
[immich-server] | }
[immich-server] |
[immich-machine-learning] | [06/15/24 09:52:22] INFO Booting worker with pid: 38
[immich-machine-learning] | [06/15/24 09:52:26] INFO Started server process [38]
[immich-machine-learning] | [06/15/24 09:52:26] INFO Waiting for application startup.
[immich-machine-learning] | [06/15/24 09:52:26] INFO Created in-memory cache with unloading after 300s
[immich-machine-learning] | of inactivity.
[immich-machine-learning] | [06/15/24 09:52:26] INFO Initialized request thread pool with 12 threads.
[immich-machine-learning] | [06/15/24 09:52:26] INFO Application startup complete.
[immich-machine-learning] | [06/15/24 09:52:27] INFO Setting 'XLM-Roberta-Large-Vit-B-16Plus' execution
[immich-machine-learning] | providers to ['CUDAExecutionProvider',
[immich-machine-learning] | 'CPUExecutionProvider'], in descending order of
[immich-machine-learning] | preference
[immich-machine-learning] | [06/15/24 09:52:27] INFO Loading visual model
[immich-machine-learning] | 'XLM-Roberta-Large-Vit-B-16Plus' to memory
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:32 PM ERROR [Microservices:JobService] Unable to run job handler (smartSearch/smart-search): Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:32 PM ERROR [Microservices:JobService] Error: Machine learning request to "http://immich-machine-learning:3003" failed with SocketError: other side closed
[immich-server] | at /usr/src/app/dist/repositories/machine-learning.repository.js:19:19
[immich-server] | at async MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:18:21)
[immich-server] | at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:42:26)
[immich-server] | at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:86:27)
[immich-server] | at async /usr/src/app/dist/services/job.service.js:148:36
[immich-server] | at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
[immich-server] | at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[immich-server] | [Nest] 2 - 06/15/2024, 5:52:32 PM ERROR [Microservices:JobService] Object:
[immich-server] | {
[immich-server] | "id": "731870bb-f8a0-4b89-8a46-8255f4fd0c43"
[immich-server] | }
[immich-server] |
[immich-machine-learning] | [06/15/24 09:52:32] INFO Booting worker with pid: 69
[immich-machine-learning] | [06/15/24 09:52:36] INFO Started server process [69]
[immich-machine-learning] | [06/15/24 09:52:36] INFO Waiting for application startup.
[immich-machine-learning] | [06/15/24 09:52:36] INFO Created in-memory cache with unloading after 300s
[immich-machine-learning] | of inactivity.
[immich-machine-learning] | [06/15/24 09:52:36] INFO Initialized request thread pool with 12 threads.
[immich-machine-learning] | [06/15/24 09:52:36] INFO Application startup complete.
Additional information
The prerequisites for using CUDA to accelerate machine learning tasks are satisfied. However, there is an issue when hardware acceleration is not being used (using the default configuration in docker-compose.yml) as no such problem arises.
❯ nvidia-container-cli info
NVRM version: 550.90.07
CUDA version: 12.4
Metadata
Metadata
Assignees
Labels
No labels