Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stub.go calls os.Exit(0) on lost connection #74

Open
champtar opened this issue Mar 12, 2024 · 4 comments
Open

stub.go calls os.Exit(0) on lost connection #74

champtar opened this issue Mar 12, 2024 · 4 comments

Comments

@champtar
Copy link
Contributor

If your NRI plugin timeouts, containerd will close the connection, and your NRI plugin os.Exit(0) without a chance to log anything

nri/pkg/stub/stub.go

Lines 519 to 528 in 53d3371

// Handle a lost connection.
func (stub *stub) connClosed() {
stub.close()
if stub.onClose != nil {
stub.onClose()
return
}
os.Exit(0)
}

@klihub do you remember what was the logic behind this ? I don't think libraries should ever call os.Exit()

Also stub.Run hides ttrpc.ErrServerClosed

nri/pkg/stub/stub.go

Lines 436 to 450 in 53d3371

// Run the plugin. Start event processing then wait for an error or getting stopped.
func (stub *stub) Run(ctx context.Context) error {
var err error
if err = stub.Start(ctx); err != nil {
return err
}
err = <-stub.srvErrC
if err == ttrpc.ErrServerClosed {
return nil
}
return err
}

Maybe Run(ctx) should only return nil if we call Stop() and/or cancel the context ?

@klihub
Copy link
Member

klihub commented Mar 12, 2024

If your NRI plugin timeouts, containerd will close the connection, and your NRI plugin os.Exit(0) without a chance to log anything

Yes, but only if the plugin does not have an onClose()-handler.

nri/pkg/stub/stub.go

Lines 519 to 528 in 53d3371

// Handle a lost connection.
func (stub *stub) connClosed() {
stub.close()
if stub.onClose != nil {
stub.onClose()
return
}
os.Exit(0)
}

@klihub do you remember what was the logic behind this ? I don't think libraries should ever call os.Exit()

The logic was roughly: If the plugin has an onClose()-handler it wants to get notified about the lost connection, so we gather it does know how to and wants to deal with the situation on its own. But if the plugin ignores lost connection (does not have an onClose()-handler), we do an os.Exit().

Also stub.Run hides ttrpc.ErrServerClosed

nri/pkg/stub/stub.go

Lines 436 to 450 in 53d3371

// Run the plugin. Start event processing then wait for an error or getting stopped.
func (stub *stub) Run(ctx context.Context) error {
var err error
if err = stub.Start(ctx); err != nil {
return err
}
err = <-stub.srvErrC
if err == ttrpc.ErrServerClosed {
return nil
}
return err
}

Maybe Run(ctx) should only return nil if we call Stop() and/or cancel the context ?

Hmm... I think we ignore ttrpc.ErrServerClosed as a cheap way of differentiating between a failed Start() and a successful run terminated by the runtime being shut down. But maybe we should do it differently, for instance by defining in stub a ErrFailedToStart = errors.New("failed to start") then always return a fmt.Errorf("%w: %w", ErrFailedToStart, err) for all errors from Start() and return any other/later errors as such. Then one could differentiate between startup- and other runtime errors by checking it with errors.Is(err, stub.ErrFailedToStart), if that is really necessary.

@champtar
Copy link
Contributor Author

The problem here is that you/the current code assumes lost connection is a normal shutdown.
For clean shutdown, containerd (nri adaptation) should do an rpc call to ask the plugin to shutdown itself, so we are sure ErrServerClosed means timeout or containerd crashing.

@klihub
Copy link
Member

klihub commented Mar 17, 2024

The problem here is that you/the current code assumes lost connection is a normal shutdown. For clean shutdown, containerd (nri adaptation) should do an rpc call to ask the plugin to shutdown itself, so we are sure ErrServerClosed means timeout or containerd crashing.

But is the Exit(0)-logic really a problem ? If you don't want the stub to exit your plugin when the connection is lost, yet you are not interested in the fact itself, you can simply instantiate the stub with an extra stub.WithOnClose(func (){}) option.

Unless your plugin is the kind which does unsolicited updates, if you don't register an OnClose-handler, you will never realize that you have lost the connection. So IMO, the least we must do in the stub is to exit runtime-launched plugins when the connection is lost since those cannot re-establish their connection. Instead of that, now we exit all plugins which have no OnClose handler.

@klihub
Copy link
Member

klihub commented Mar 17, 2024

About swallowing ErrServerClosed in Run() once the stub has been successfully Start()ed, I think it should be safe to get rid of that. Just need to check if I rely on that behavior in some tests or integration tests and fix them if I do...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants