-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to check why a job is killed? #2092
Comments
Your log shows that |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I currently use rq to schedule neural network training jobs using mmdetection (a framework based on pytorch) in a docker environment. However, the training job sometimes gets killed unexpectedly:
At first, I thought this might be a memory problem, but increasing the docker container's memory limit does not help resolve this problem. I also noticed that rq would kill the job when pytorch tries to download a pretrained model.
However, the mmpretrain (a classification package similar to mmdetection) package works smoothly within a rq job. It also works fine with
subprocess.run
function within the rq job.How can I find what caused the problem?
Any information would be helpful!
The text was updated successfully, but these errors were encountered: