You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're running torque 4.2.10 and we're seeing communication failures at sporadic intervals. The server log shows the following type of error:
02/28/2017 01:14:19;0004;PBS_Server.15205;Svr;authenticate_user;Hosts do not match: Requested host korf.nikhef.nl: credential host: stremsel.nikhef.nl
There is no rhyme or rhythm found in the names of the hosts; they could be hosts from which jobs are submitted, the torque server itself or any one of the worker nodes.
We know that this error is reproducible in a consistent manner when the clock on one of the nodes is wrong; somehow the message is signed (by trqauthd?) with a timestamp, causing a mismatch in the identity/credential checking, but we've since made sure all our hosts are using ntp.
I have had a sidelong glance at the code where the checks are done, but I found the caching algorithm hard to understand.
The text was updated successfully, but these errors were encountered:
I wasn't sure what you'd meant from what you said.
Is there a way that you can reproduce this? FWIW, I think it's very likely that upgrading will fix this issue, but I can't point to a specific changeset. It has been years since we've checked anything other than security fixes into the 4.2-dev tree.
It's not easy to reproduce as it's intermittent, however I do see the same behaviour on our test bed which is easier for me to debug without causing disruptions to the production system.
We're running torque 4.2.10 and we're seeing communication failures at sporadic intervals. The server log shows the following type of error:
There is no rhyme or rhythm found in the names of the hosts; they could be hosts from which jobs are submitted, the torque server itself or any one of the worker nodes.
We know that this error is reproducible in a consistent manner when the clock on one of the nodes is wrong; somehow the message is signed (by trqauthd?) with a timestamp, causing a mismatch in the identity/credential checking, but we've since made sure all our hosts are using ntp.
I have had a sidelong glance at the code where the checks are done, but I found the caching algorithm hard to understand.
The text was updated successfully, but these errors were encountered: