Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch termination signal #62

Conversation

martinkennelly
Copy link
Member

In order for termination signal to be captured, two goroutines
are created which handle TLS keypair reloading and HTTP server
instantiation.

Signed-off-by: Kennelly, Martin [email protected]

Fixes: #48

@martinkennelly martinkennelly changed the title Catch termination signal WIP: Catch termination signal Jan 25, 2021
@martinkennelly
Copy link
Member Author

I want to update this patch to make it easier to test.
I will add test cases.

@martinkennelly martinkennelly changed the title WIP: Catch termination signal Catch termination signal Feb 16, 2021
In order to catch termination signals, changes to how we handle
cert/key watcher & HTTP server are needed.
Refactor code to use goroutines for TLS cert/key watcher and
HTTP server.
Add Channel to safely manage signals from goroutines.
Add interfaces to aid testing with mockery.
Add tests to cover new code changes.

Signed-off-by: Kennelly, Martin <[email protected]>
@martinkennelly
Copy link
Member Author

A lot of changes here I know. I improved the test coverage to 33%. I am hoping to add e2e tests in a separate PR using kind and this will help validate this patch.

@martinkennelly
Copy link
Member Author

martinkennelly commented Apr 10, 2021

@zshi-redhat Do you see value with this patch? I was going to push for it once E2E test cases got to a sufficient level where we can do regression testing of the features. @MichalGuzieniuk is working on adding these right now.

I want this patch for the following reasons but it will need to be reworked to accommodate your user defined injections work but that is no biggie once we have tests for it.

Reasons:

  • NRI will catch term signals - currently it doesn't handle them.
  • NRI will end if HTTP server fails - currently nothing happens if it fails.
  • NRI will end if keypair reloader fails - current we just keep retrying.
  • Improved unit test coverage from ~20% to 33%
  • Improved separation of code instead of it all bundled into webhook.go. We need to stop that going forward imo.

I propose adding user injections golang routine as well and if it fails, NRI should end, and be re-provisioned.

Anyway, Its up to you if you see any value with this patch long term. Let me know when you have time. No rush as I think its good to fix the existing problem with user injection first, and increase the e2e test cases (thanks to @MichalGuzieniuk ) to ensure we have regression testing when this major change occurs.

@zshi-redhat
Copy link
Collaborator

@zshi-redhat Do you see value with this patch? I was going to push for it once E2E test cases got to a sufficient level where we can do regression testing of the features. @MichalGuzieniuk is working on adding these right now.

+1, yes, absolutely.

I want this patch for the following reasons but it will need to be reworked to accommodate your user defined injections work but that is no biggie once we have tests for it.

Reasons:

* NRI will catch term signals - currently it doesn't handle them.

* NRI will end if HTTP server fails - currently nothing happens if it fails.

* NRI will end if keypair reloader fails - current we just keep retrying.

* Improved unit test coverage from ~20% to 33%

* Improved separation of code instead of it all bundled into webhook.go. We need to stop that going forward imo.

I think these are positive moves.
I had the same idea of separating giant webhook.go into multiple files, didn't get chance to implement it :-)

I propose adding user injections golang routine as well and if it fails, NRI should end, and be re-provisioned.

User-defined-injections is an optional feature for now, We probably don't want to end the NRI process for the absense of configMap or so.

Anyway, Its up to you if you see any value with this patch long term. Let me know when you have time. No rush as I think its good to fix the existing problem with user injection first, and increase the e2e test cases (thanks to @MichalGuzieniuk ) to ensure we have regression testing when this major change occurs.

Thanks for working on these improvements!

if c.IsClosed() {
return nil
}
time.Sleep(interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we derive sleep interval from limit ?


const (
interval = time.Millisecond * 2
chBuffer = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move buffer size into Channel and configurable from NewChannel API?

@@ -0,0 +1,92 @@
package webhook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this into another package pkg/tools/channel.go ?

Copy link
Member Author

@martinkennelly martinkennelly Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in golang, its an antipattern to put funcs in packages named "tools" or "utils".
"Name your packages after what they provide, not what they contain."
However, having everything in package webhook is not good either and something I want to improve. I will move the channel stuff into a channel package. Thanks @pperiyasamy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good info! I will follow this convention ;)

@@ -832,3 +835,44 @@ func SetInjectHugepageDownApi(hugepageFlag bool) {
func SetHonorExistingResources(resourcesHonorFlag bool) {
honorExistingResources = resourcesHonorFlag
}

//Watch blocks until either TLS cert & key watcher or HTTP server or termination signal generated
func Watch(server nri.Service, watcher nri.Service, term chan os.Signal) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this method more generic by making nri.Service as var-args ?
this would also enable nad cache service for handling termination signal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pperiyasamy Good idea but I don't know how I could use select to watch all the service signals. If you can think of a pattern let me know.

Copy link
Contributor

@pperiyasamy pperiyasamy Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try something like this (not tested, though) ?

	ch := make(chan int)
	for idx := range services {
		go func(serv Service, servIdx int) {
			<-serv.StatusSignal()
			for otherSrvIdx, otherService := range services {
				if otherSrvIdx == servIdx {
					continue
				}
				otherService.Quit()
			}
			close(ch)
		}(services[idx], idx)
	}
	<-ch
	return

case <-watcher.StatusSignal(): // when TLS cert & key watcher finishes unexpectedly
glog.Error("TLS key & cert watcher ended")
err = server.Quit()
case <-server.StatusSignal(): // when HTTP server finishes unexpectedly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a question: when write happens on this server channel ? i couldn't find it anywhere in the code :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pperiyasamy see server.go: within method with pointer receiver Run() -> defer mSrv.Status.Close(). When the channel closes:

After calling close, and after any previously sent values have been received, receive operations will return the zero value for the channel's type without blocking.

ref: https://golang.org/ref/spec#Close

.. this causes this case to trigger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! ok, that means server.go'smSrv.instance.Start()is blocking call, right ?

@zshi-redhat
Copy link
Collaborator

@martinkennelly Would you mind rebase this PR? Thank you!

@martinkennelly
Copy link
Member Author

Closing this as I don't have time to work on it. Anyone feel free to carry it on.

@MichalGuzieniuk
Copy link
Contributor

@martinkennelly if you don't mind I will take codebase done by you in this PR, make the rebase and I will continue work on it.

@martinkennelly
Copy link
Member Author

martinkennelly commented Sep 28, 2021

@MichalGuzieniuk Don't mind at all. I got stuck continuously refactoring NRI with this PR. Here is the outcome: https://github.com/martinkennelly/network-resources-injector/commits/refactor_webhook

The refactor went too far and I was concerned with that... I hope you can do a better job limiting the refactor scope than I did..
I think that refactor is good long term, just the scope of changes...

@martinkennelly
Copy link
Member Author

I converted everything into services which could be watched, killed. I separated out all the code into pkgs as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NRI doesn't catch termination signals and end
4 participants