How to check why a job is killed? #2092

tctco · 2024-05-19T13:52:15Z

I currently use rq to schedule neural network training jobs using mmdetection (a framework based on pytorch) in a docker environment. However, the training job sometimes gets killed unexpectedly:

2024-05-19 21:41:11 05/19 13:41:11 - mmengine - INFO - load backbone. in model from: /segtracker/resources/models/pretrained/cspnext-tiny_imagenet_600e.pth
2024-05-19 21:41:11 Loads checkpoint by local backend from path: /segtracker/resources/models/pretrained/cspnext-tiny_imagenet_600e.pth
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - INFO - Checkpoints will be saved to /backend/trained_models/152-1-13-mouse-det.
2024-05-19 21:41:13 13:41:13 Killed horse pid 2424
2024-05-19 21:41:13 13:41:13 Job stopped by user, moving job to FailedJobRegistry

At first, I thought this might be a memory problem, but increasing the docker container's memory limit does not help resolve this problem. I also noticed that rq would kill the job when pytorch tries to download a pretrained model.

2024-05-19 21:48:49 creating index...
2024-05-19 21:48:49 index created!
2024-05-19 21:48:49 05/19 13:48:49 - mmengine - INFO - load model from: torchvision://resnet50
2024-05-19 21:48:49 05/19 13:48:49 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50
2024-05-19 21:48:49 Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
2024-05-19 21:48:51 13:48:51 Killed horse pid 2475
2024-05-19 21:48:51 13:48:51 Job stopped by user, moving job to FailedJobRegistry

However, the mmpretrain (a classification package similar to mmdetection) package works smoothly within a rq job. It also works fine with subprocess.run function within the rq job.

How can I find what caused the problem?

Any information would be helpful!

The text was updated successfully, but these errors were encountered:

selwin · 2024-05-27T01:59:58Z

Your log shows that Job stopped by user, moving job to FailedJobRegistry, meaning the job was stopped because it received a stop job command.

selwin closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to check why a job is killed? #2092

How to check why a job is killed? #2092

tctco commented May 19, 2024 •

edited

Loading

selwin commented May 27, 2024

How to check why a job is killed? #2092

How to check why a job is killed? #2092

Comments

tctco commented May 19, 2024 • edited Loading

selwin commented May 27, 2024

tctco commented May 19, 2024 •

edited

Loading