Issue339 try4 part1 fail connections in main loop #1247
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
replaced #1244 by cherry picking just the relevant commits and adding them to an uptodate development tree... #1244 was getting very confusing with > 20 commits of mixes of the two branches. With this restart, it's only a handful of commits, much easier to review.
goal: If we have multiple broker connections, then we can't loop waiting to connect to each one... because every time
one broker connection goes down, it will just loop there.
main loop then calls them again.
This is also consistent with the pattern of trying to have a single place where the code sleeps.
The only change in this code is to remove the loop (which outdented the entire routine by 4... sigh...) changing "break" to "return" in a few places, and removing the sleeps.
So that's the first few patches... later it dawned that even if you loop from mainline... when you do a connection attempt and it hangs... you will spend all your time looping and hanging. To get some free time (so other connections can be used.) added exponential back off to connection attempts. That's the rest of the commits. Look at how long it took for the connection to fail, and then back off for at least twice that long.
note that when failures are quick, they can take variable amounts of time, so the ebo might not continuously increase because the backoff period is a multiple of how long it took to fail, so if it fails faster, it might want to retry sooner.
you get an odd pattern of retries... that generally does the right thing, but looks wrong...