-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic Agent service configured with delayed enrollment exits when it cannot connect. #4716
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
elastic-agent/internal/pkg/agent/cmd/run.go Lines 185 to 188 in fdacc6c
The current behavior seems to be that the agent process exits if delayed enrollment fails after a few retries. I think there is an assumption that the service manager is going to automatically restart the process so it can try again. If that isn't valid in all circumstances this logic needs to change. |
Shared this in Slack: It looks like the issue is due to the manual starting of the service and if you rebooted you wouldnt see the issue? Per-craig's comment, the indefinite retry behavior is a combination of elastic agent retries combined with windows service manager retries. As the Automatic startup mode from the Windows service manager only takes effect on next boot, not when you manually start the service, the retries are limited to whatever set of retries are baked into Agent. Can you confirm whether a reboot occurs between when the customer runs install with --delay-enroll and when they manually start the service? |
Following up on above - same thing occurs after rebooting. The log indicates the same behavior, startup is attempted but since connection fails, Elastic Agent exits. This brings up a good question - when retries are working as expected will they also stop after the noted "10 mins, every 1 min" as noted in the logs? Are there any other options, for example, to retry indefinitely and not rely on user to have to reboot or some other 'watchdog' service? |
Looks like the service fails to start so its restarts don't get managed by the service manager:
This appears to match @ajoliveira's read on the issue:
|
Interesting, I wonder at what point it considers us started enough to be restarted. We should be sending the StartPending and Running service control events asynchronously to the delayed enrollment, but those happen asynchronously in a goroutine and perhaps they aren't actually happening in time. elastic-agent/internal/pkg/agent/cmd/run.go Lines 139 to 188 in 49745a7
|
CC @leehinman in case you know what the exact problem is off the top of your head. |
yeah, you have to get to https://github.com/elastic/elastic-agent-libs/blob/2c654640b6d541a1054930d956f23261dbe64288/service/service_windows.go#L48 and that is in another go routine. |
After installing Elastic Agent to use delayed enrollment, when service is (re)started and it fails to enroll, the service simply exists and does not continue to automatically retry. Services panel message is:
"The Elastic Agent service on Local Computer started and then stopped. Some services stop automatically if they are not in use by other services or programs."
The 2 scenarios I used to test this were:
Both log outputs indicate that it is attempting a delayed enrollment and that the 1st enrollment failed and would be retried for 10 mins, every 1 min to specified Fleet server URL but the process/services just exits after that.
For confirmed bugs, please report:
--delay-enroll
optionStep 5 always fails since the service/process exits immediately after reporting a connectivity issue.
The text was updated successfully, but these errors were encountered: