Skip to content
This repository has been archived by the owner on Jul 21, 2024. It is now read-only.

[BUG] All rendering stops if a single agent stops responding #134

Open
meermanr opened this issue Dec 30, 2020 · 1 comment
Open

[BUG] All rendering stops if a single agent stops responding #134

meermanr opened this issue Dec 30, 2020 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@meermanr
Copy link

Issue description:

Last night I left BlendNet rendering out an animation and went to bed, and in the morning found all my equipment idle but lots of render jobs shown as PENDING in the job list.

.. it seems that I forgot to attach the charger to one of my laptops, so at some point in the night it turned off. Because my laptops are not cloud providers, the agent did not receive a SIGTERM notice and inform the Manager: it just stopped responding to the Manager's status requests.

It appears that the Manager does not give out any more work packages to agents until the "missing agent" responds. I would expect that the Manager would give work out to the remaining Agents and carry on getting work done.

Environment:

  • Application version: v0.3.5-4-g6f54c2f
  • Blender client version: 2.91.0
  • Blender worker version: 2.91.0
  • OS: Mac OS 11.1 (Addon & Agent) / Mac OS 10.15.7 (Manager & Agent) / Ubuntu 18.04 (Agent)

Steps to reproduce:

Steps to reproduce the behavior:

  1. Run Blender
  2. Add two or more agents to BlendNet, and ensure they are showing checks ✓
  3. Kill one of the Agents (I pressed ^C twice in the terminal)
  4. Open project, or save default scene to a file
  5. Click "Run image task"
  6. See that it enters RUNNING state, but never completes

In my case I started with four agents online, and took one offline before clicking "Run image task". The task seems to get stuck at 75% (1/4), so it appears the Manager has decided the missing agent should take this task, and until it does all other agents will be idle.

Expected behavior

The task to complete, because other agents are available.

I would expect the Manager to give work to any agent that is online and idle, i.e. if an agent goes offline (gracefully or otherwise) the other agents will do the work, eventually.

(This makes me wonder what happens when some agents are significantly more powerful than others. For example, one of my agents is a server in a rack with 40x CPU cores, while another is an Arm-powered Macbook: does the server race ahead of the others, but sit idle until the others have finished their assigned tasks?)

Screenshots

CleanShot 2020-12-30 at 14 54 12@2x

Additional context

Manager logs are full of:

WARN: Communication issue with request to "https://Roberts-Air.lan:9443/api/v1/status": [Errno 61] Connection refused

Note that if I bring the missing agent back, it will be assigned work and once that is complete the render job as a whole completes successfully.

@meermanr meermanr added the bug Something isn't working label Dec 30, 2020
@rabits rabits added this to the v0.4.0 milestone Dec 31, 2020
@rabits
Copy link
Member

rabits commented Dec 31, 2020

Yep, for sure Manager should know about that - will look at the bug during v0.4

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants