[BUG] All rendering stops if a single agent stops responding #134

meermanr · 2020-12-30T15:07:49Z

Issue description:

Last night I left BlendNet rendering out an animation and went to bed, and in the morning found all my equipment idle but lots of render jobs shown as PENDING in the job list.

.. it seems that I forgot to attach the charger to one of my laptops, so at some point in the night it turned off. Because my laptops are not cloud providers, the agent did not receive a SIGTERM notice and inform the Manager: it just stopped responding to the Manager's status requests.

It appears that the Manager does not give out any more work packages to agents until the "missing agent" responds. I would expect that the Manager would give work out to the remaining Agents and carry on getting work done.

Environment:

Application version: v0.3.5-4-g6f54c2f
Blender client version: 2.91.0
Blender worker version: 2.91.0
OS: Mac OS 11.1 (Addon & Agent) / Mac OS 10.15.7 (Manager & Agent) / Ubuntu 18.04 (Agent)

Steps to reproduce:

Steps to reproduce the behavior:

Run Blender
Add two or more agents to BlendNet, and ensure they are showing checks ✓
Kill one of the Agents (I pressed ^C twice in the terminal)
Open project, or save default scene to a file
Click "Run image task"
See that it enters RUNNING state, but never completes

In my case I started with four agents online, and took one offline before clicking "Run image task". The task seems to get stuck at 75% (1/4), so it appears the Manager has decided the missing agent should take this task, and until it does all other agents will be idle.

Expected behavior

The task to complete, because other agents are available.

I would expect the Manager to give work to any agent that is online and idle, i.e. if an agent goes offline (gracefully or otherwise) the other agents will do the work, eventually.

(This makes me wonder what happens when some agents are significantly more powerful than others. For example, one of my agents is a server in a rack with 40x CPU cores, while another is an Arm-powered Macbook: does the server race ahead of the others, but sit idle until the others have finished their assigned tasks?)

Screenshots

Additional context

Manager logs are full of:

WARN: Communication issue with request to "https://Roberts-Air.lan:9443/api/v1/status": [Errno 61] Connection refused

Note that if I bring the missing agent back, it will be assigned work and once that is complete the render job as a whole completes successfully.

The text was updated successfully, but these errors were encountered:

rabits · 2020-12-31T23:22:51Z

Yep, for sure Manager should know about that - will look at the bug during v0.4

meermanr added the bug Something isn't working label Dec 30, 2020

rabits added this to the v0.4.0 milestone Dec 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] All rendering stops if a single agent stops responding #134

[BUG] All rendering stops if a single agent stops responding #134

meermanr commented Dec 30, 2020

rabits commented Dec 31, 2020

[BUG] All rendering stops if a single agent stops responding #134

[BUG] All rendering stops if a single agent stops responding #134

Comments

meermanr commented Dec 30, 2020

Issue description:

Environment:

Steps to reproduce:

Expected behavior

Screenshots

Additional context

rabits commented Dec 31, 2020