Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent service configured with delayed enrollment exits when it cannot connect. #4716

Closed
ajoliveira opened this issue May 8, 2024 · 8 comments · Fixed by #4727
Closed
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@ajoliveira
Copy link

After installing Elastic Agent to use delayed enrollment, when service is (re)started and it fails to enroll, the service simply exists and does not continue to automatically retry. Services panel message is:
"The Elastic Agent service on Local Computer started and then stopped. Some services stop automatically if they are not in use by other services or programs."

The 2 scenarios I used to test this were:

  • no network (ie: wi-fi disabled, errors is DNS failure)
  • with network & wrong DNS (failed connection to Fleet server)

Both log outputs indicate that it is attempting a delayed enrollment and that the 1st enrollment failed and would be retried for 10 mins, every 1 min to specified Fleet server URL but the process/services just exits after that.

For confirmed bugs, please report:

  • Version: 8.13.3
  • Operating System: Windows 10 Home
  • Discuss Forum URL:
  • Steps to Reproduce:
  1. Install Elastic Agent with --delay-enroll option
  2. Confirm service is created with 'Automatic' startup
  3. Disable network or override DNS resolution to replicate a failed connection
  4. Attempt to start the service manually
  5. Confirm service/process continues to run and retry enrollment

Step 5 always fails since the service/process exits immediately after reporting a connectivity issue.

@ajoliveira ajoliveira added the bug Something isn't working label May 8, 2024
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 8, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented May 8, 2024

cfg, err = tryDelayEnroll(ctx, l, cfg, override)
if err != nil {
return logReturn(l, errors.New(err, "failed to perform delayed enrollment"))
}

The current behavior seems to be that the agent process exits if delayed enrollment fails after a few retries. I think there is an assumption that the service manager is going to automatically restart the process so it can try again. If that isn't valid in all circumstances this logic needs to change.

@strawgate
Copy link

strawgate commented May 8, 2024

Shared this in Slack:

It looks like the issue is due to the manual starting of the service and if you rebooted you wouldnt see the issue?

Per-craig's comment, the indefinite retry behavior is a combination of elastic agent retries combined with windows service manager retries.

As the Automatic startup mode from the Windows service manager only takes effect on next boot, not when you manually start the service, the retries are limited to whatever set of retries are baked into Agent.

Can you confirm whether a reboot occurs between when the customer runs install with --delay-enroll and when they manually start the service?

@ajoliveira
Copy link
Author

Following up on above - same thing occurs after rebooting. The log indicates the same behavior, startup is attempted but since connection fails, Elastic Agent exits.

This brings up a good question - when retries are working as expected will they also stop after the noted "10 mins, every 1 min" as noted in the logs? Are there any other options, for example, to retry indefinitely and not rely on user to have to reboot or some other 'watchdog' service?

@strawgate
Copy link

strawgate commented May 8, 2024

Looks like the service fails to start so its restarts don't get managed by the service manager:

net start "Elastic Agent"
The Elastic Agent service is starting.
The Elastic Agent service could not be started.

The service did not report an error.

More help is available by typing NET HELPMSG 3534.

PS C:\Users\bill_easton\elastic-agent-8.12.2-windows-x86_64>

This appears to match @ajoliveira's read on the issue:

Wondering if this explains what's going on - the service exits before it is considered to be 'running' and thus the Recovery/Retries are never attempted? And looks like we expect that to be the case that the service manager would handle this via this PR?

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

Interesting, I wonder at what point it considers us started enough to be restarted.

We should be sending the StartPending and Running service control events asynchronously to the delayed enrollment, but those happen asynchronously in a goroutine and perhaps they aren't actually happening in time.

defer cancel()
go service.ProcessWindowsControlEvents(stopBeat)
return runElasticAgent(ctx, cancel, override, stop, testingMode, fleetInitTimeout, false, nil, modifiers...)
}
func logReturn(l *logger.Logger, err error) error {
if err != nil && !errors.Is(err, context.Canceled) {
l.Errorf("%s", err)
}
return err
}
func runElasticAgent(ctx context.Context, cancel context.CancelFunc, override cfgOverrider, stop chan bool, testingMode bool, fleetInitTimeout time.Duration, runAsOtel bool, awaiters awaiters, modifiers ...component.PlatformModifier) error {
cfg, err := loadConfig(ctx, override, runAsOtel)
if err != nil {
return err
}
logLvl := logger.DefaultLogLevel
if cfg.Settings.LoggingConfig != nil {
logLvl = cfg.Settings.LoggingConfig.Level
}
baseLogger, err := logger.NewFromConfig("", cfg.Settings.LoggingConfig, true)
if err != nil {
return err
}
// Make sure to flush any buffered logs before we're done.
defer baseLogger.Sync() //nolint:errcheck // flushing buffered logs is best effort.
l := baseLogger.With("log", map[string]interface{}{
"source": agentName,
})
// try early to check if running as root
isRoot, err := utils.HasRoot()
if err != nil {
return logReturn(l, fmt.Errorf("failed to check for root/Administrator privileges: %w", err))
}
l.Infow("Elastic Agent started",
"process.pid", os.Getpid(),
"agent.version", version.GetAgentPackageVersion(),
"agent.unprivileged", !isRoot)
cfg, err = tryDelayEnroll(ctx, l, cfg, override)
if err != nil {
return logReturn(l, errors.New(err, "failed to perform delayed enrollment"))
}

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

CC @leehinman in case you know what the exact problem is off the top of your head.

@leehinman
Copy link
Contributor

Interesting, I wonder at what point it considers us started enough to be restarted.

yeah, you have to get to https://github.com/elastic/elastic-agent-libs/blob/2c654640b6d541a1054930d956f23261dbe64288/service/service_windows.go#L48 and that is in another go routine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants