Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero retryLimit Support in ReplayClient #442

Merged
merged 32 commits into from
May 8, 2024
Merged

Conversation

ezeike
Copy link
Contributor

@ezeike ezeike commented Mar 19, 2024

Summary

Enables the replay client to continuously attempt to reconnect to the pocket node.

Issue

Current implementation of the replay client will panic after 10 failed connection attempts.

Type of change

Select one or more:

  • New feature, functionality or library
  • Bug fix
  • Code health or cleanup
  • Documentation
  • Other (specify)

Testing

  • Unit Tests: make go_develop_and_test
  • LocalNet E2E Tests: make test_e2e
  • DevNet E2E Tests: Add the devnet-test-e2e label to the PR. THIS IS VERY EXPENSIVE, so only do it after all the reviews are complete.
  • Documentation changes: make docusaurus_start

Sanity Checklist

  • I have tested my changes using the available tooling
  • I have commented my code
  • I have performed a self-review of my own code; both comments & source code
  • I create and referenced any new tickets, if applicable
  • I have left TODOs throughout the codebase, if applicable

@ezeike ezeike added the enhancement New feature or request label Mar 19, 2024
@ezeike ezeike added this to the Shannon TestNet milestone Mar 19, 2024
@ezeike ezeike requested a review from red-0ne March 19, 2024 19:10
@ezeike ezeike self-assigned this Mar 19, 2024
@ezeike ezeike force-pushed the zero-retrylimit-replayclient branch from 077fe70 to 7fd7a33 Compare March 19, 2024 19:12
Copy link
Contributor

@h5law h5law left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Just want you to add a TODO then good to go!

pkg/client/events/replay_client.go Outdated Show resolved Hide resolved
h5law
h5law previously requested changes Mar 19, 2024
Copy link
Contributor

@h5law h5law left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second look at

if publishError != nil {
you need to change the logic in the retry.OnError method as well as this or you will panic if the connection ever closes and needs to be reestablihed

@Olshansk Olshansk requested review from bryanchriswhite and removed request for red-0ne March 20, 2024 03:06
Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryanchriswhite Added you as as a reviewer instead of @red-0ne since you have more context into the reply client and wanted to get your 👀 on this.

pkg/retry/retry.go Outdated Show resolved Hide resolved
@ezeike ezeike force-pushed the zero-retrylimit-replayclient branch from 3ec9494 to 126353d Compare March 20, 2024 16:14
@ezeike
Copy link
Contributor Author

ezeike commented Mar 20, 2024

Some questions before more changes are made:

  • Under what conditions is errCh (in OnError) closed? If that does happen, can we count that as an error and retry the same as if it was an error that came through errCh?
  • Do we need a retryLimit at all? At the moment it does not appear we do, is there a future scenario where we might?

@Olshansk
Copy link
Member

@bryanchriswhite Can you PTAL at @ezeike questions before you go OOO?

@h5law
Copy link
Contributor

h5law commented Mar 21, 2024

@ezeike

Under what conditions is errCh (in OnError) closed? If that does happen, can we count that as an error and retry the same as if it was an error that came through errCh?

The errCh in retry.OnError closes after we have reached the max number of retry attempts. retry.OnError essentially takes a work function that returns an error channel and does this until it errors then retries a certain number of times. See [1] for the work function that is supplied by the replay client. The two instances where it returns an error on the channel (causing a retry in retry.OnError are:

  1. We were unable to subscribe to the events query client for some reason.
  2. The events query client closed for some reason.

These may be due to network issues or some unexpected error or whatever but these are naturally possibilities. If either of these reasons do happen we are already treating them as errors but in order for the retry logic to work using a channel makes sense.

I am a little unsure what you mean by retry in the same way but treat them as errors - it sounds like that would just be the same thing to me?

Do we need a retryLimit at all? At the moment it does not appear we do, is there a future scenario where we might?

The retryLimit is essentially there to stop an infinite loop occurring if say a connection to the full-node providing the on-chain data a gateway is faulty or the node goes offline.

By setting the eventsBytesRetryLimit to ^uint64(0) and removing the if retryLimit > 0 check you will get the same result - infinitely retry (essentially). As soon as you have a successful connection to an events query client for 10 seconds your retryCount resets so you can again fail and retry to connect over and over.

@ezeike Are you finding that your connections are failing a lot and you are reaching the 10 retry limit often? If so I think investigating what is causing that could help us figure out a better solution than an infinite loop.

[1]

func (rClient *replayClient[T]) retryPublishEventsFactory(ctx context.Context) func() chan error {

@ezeike
Copy link
Contributor Author

ezeike commented Mar 21, 2024

@ezeike

Under what conditions is errCh (in OnError) closed? If that does happen, can we count that as an error and retry the same as if it was an error that came through errCh?

The errCh in retry.OnError closes after we have reached the max number of retry attempts. retry.OnError essentially takes a work function that returns an error channel and does this until it errors then retries a certain number of times. See [1] for the work function that is supplied by the replay client. The two instances where it returns an error on the channel (causing a retry in retry.OnError are:

  1. We were unable to subscribe to the events query client for some reason.
  2. The events query client closed for some reason.

These may be due to network issues or some unexpected error or whatever but these are naturally possibilities. If either of these reasons do happen we are already treating them as errors but in order for the retry logic to work using a channel makes sense.

I am a little unsure what you mean by retry in the same way but treat them as errors - it sounds like that would just be the same thing to me?

Errors are received through the error channel and those trigger a Sleep and a retry. However when the error channel is closed OnError returns and the SDK panics. A first thought was just to eliminate that return so the execution continues on to the same Sleep and retry logic. workFn() is called every loop and errCh created anew anyway.

The current design is the SDK is to take care of connection details and do its best to ensure there's always a connection to a node. We should keep this as much as possible to keep using the SDK as simple as we can.

Either:

  • SDK is modified to remove the panic and return an error to the gateway when retry limit reached. Then the gateway can make a new SDK and try again.
  • SDK is modified to allow unlimited retries.

At the moment we should pick the quickest one that solves the issue so we can get our gateway on testnet as soon as possible. The second option was quicker so I went with that.

There are other discussions to have around who reconnects, how to support using multiple nodes and how configurable the SDK should be but for now we want an MVP ASAP.

Do we need a retryLimit at all? At the moment it does not appear we do, is there a future scenario where we might?

The retryLimit is essentially there to stop an infinite loop occurring if say a connection to the full-node providing the on-chain data a gateway is faulty or the node goes offline.

From a gateway perspective we want an infinite loop. SDK only allows for specifying a single node and it should do its best to always be connected to that node.

The infinite loop doesn't have to exist in OnError though, we could do it in goPublishEvents in this case.

If we don't have an infinite loop the gateway will have to handle connection failures and re-create the SDK. That's reasonable too, but shifts some work to gateway devs.

By setting the eventsBytesRetryLimit to ^uint64(0) and removing the if retryLimit > 0 check you will get the same result - infinitely retry (essentially). As soon as you have a successful connection to an events query client for 10 seconds your retryCount resets so you can again fail and retry to connect over and over.

If we do this retryCount >= retryLimit will be true on the first pass and return. We'd need another tweak.

@ezeike Are you finding that your connections are failing a lot and you are reaching the 10 retry limit often? If so I think investigating what is causing that could help us figure out a better solution than an infinite loop.

I saw the error when I didn't have localnet running. Moving forward though we can't have the SDK panic when a node connection is lost or not available.

@h5law
Copy link
Contributor

h5law commented Mar 21, 2024

I'll try keep this one shorter @ezeike.

Errors are received through the error channel and those trigger a Sleep and a retry. However when the error channel is closed OnError returns and the SDK panics. A first thought was just to eliminate that return so the execution continues on to the same Sleep and retry logic. workFn() is called every loop and errCh created anew anyway.
The current design is the SDK is to take care of connection details and do its best to ensure there's always a connection to a node. We should keep this as much as possible to keep using the SDK as simple as we can.
Either:
SDK is modified to remove the panic and return an error to the gateway when retry limit reached. Then the gateway can make a new SDK and try again.
SDK is modified to allow unlimited retries.
At the moment we should pick the quickest one that solves the issue so we can get our gateway on testnet as soon as possible. The second option was quicker so I went with that.
There are other discussions to have around who reconnects, how to support using multiple nodes and how configurable the SDK should be but for now we want an MVP ASAP.

Thoughts about adding this

if err := recover(); err != nil {
	errMsg := err.Error()
	// You could be extra verbose by checking for more parts of the panic's error
	// message by including the retry limit and error from the retry.OnError but
	// this is the error message given to the panic we are talking about
	// see: ./pkg/client/events/replay_client.go#L199
	if strings.Contains(errMsg, ".goPublishEvents should never reach this spot:") {
		restartSDK() // or however you do this I am unfamiliar but rest is correct
	} else {
		panic(err)
	}
}

This means we can still have the limit - fixes your problem of needing and infinite loop as is very minimal code change quite simple in fact. Placing this in the SDK somewhere it is appropriate will be pretty easy IMO @red-0ne and you @ezeike have more context here than I do.

If we do this retryCount >= retryLimit will be true on the first pass and return. We'd need another tweak.

If we don't go with the above solution we will count down instead of count up if that makes sense as we will use the max uint64 value counting down is the only real option - again minimal changes.

@ezeike ezeike force-pushed the zero-retrylimit-replayclient branch 5 times, most recently from f57bd58 to 99cfe9a Compare March 27, 2024 14:43
@ezeike
Copy link
Contributor Author

ezeike commented Mar 27, 2024

IMO the proper approach here is to refactor replay client to remove the panics that are present in the package.

As a consequence there is now no way to limit the number of connection attempts the replay client makes. We need to have some feedback before we allow limiting reconnection attempts.

Ideally we give an error channel to replay client to signal to the caller that it is giving up but that is a larger PR and would require more complexity in our gateway.

We (the backend team) have decided we'd like to the panics and move forward with testing as-is. We can re-evaluate the error channel at a later date if the need becomes apparent.

If we were to implement an error channel (and a config option for retryLimit) we'd need to modify these three clients:

BlockClient, DelegationClient, TxClient

their associated supplier functions, and both gateways (appgate and grove gateway) to handle the config and extra state management.

@ezeike ezeike requested review from h5law and Olshansk March 27, 2024 15:17
@ezeike ezeike force-pushed the zero-retrylimit-replayclient branch from 99cfe9a to 396f4c3 Compare March 29, 2024 14:19
pkg/client/events/replay_client.go Outdated Show resolved Hide resolved
pkg/client/events/replay_client.go Outdated Show resolved Hide resolved
pkg/client/events/replay_client.go Outdated Show resolved Hide resolved
@bryanchriswhite bryanchriswhite force-pushed the zero-retrylimit-replayclient branch 2 times, most recently from 5b99b80 to 0d25513 Compare April 3, 2024 10:15
@Olshansk Olshansk added the off-chain Off-chain business logic label Apr 5, 2024
pkg/retry/retry_test.go Outdated Show resolved Hide resolved
Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezeike

  1. Please merge with main
  2. See bryan's comment
  3. Make sure E2E tests pass on DevNet

Will approve after ☝️

@ezeike ezeike force-pushed the zero-retrylimit-replayclient branch from 8b008c7 to 6680149 Compare April 25, 2024 21:13
bryanchriswhite and others added 5 commits April 26, 2024 09:08
…lient

* pokt/main:
  [LocalNet][Hotfix] Turn off pproff by default (#500)
  [Tooling] Add pprof endpoints and documentation (#484)
  [Documentation] Basic blockchain operations (#454)
  [BlockClient] Refactor BlockClient to fetch latest block at init (#444)
  [Tooling] Add claim/proof/settlement dashboard & link (#479)
  [Config] feat: Simplify relay miner config (#477)
@bryanchriswhite

This comment was marked as outdated.

@Olshansk
Copy link
Member

Olshansk commented Apr 29, 2024

@ezeike @bryanchriswhite Could one of you please tend to the remaining TODO_IN_THIS_ comments?

Screenshot 2024-04-29 at 10 55 40 AM

@ezeike ezeike force-pushed the zero-retrylimit-replayclient branch from e375bf8 to 4bf968a Compare May 2, 2024 18:37
Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving now that I see E2E tests passing in 0693ca1

@ezeike ezeike merged commit c6cb4d0 into main May 8, 2024
9 checks passed
@bryanchriswhite bryanchriswhite deleted the zero-retrylimit-replayclient branch May 8, 2024 09:25
bryanchriswhite added a commit that referenced this pull request May 8, 2024
* pokt/main:
  [Code Health] chore: cleanup localnet testutils (#515)
  Zero retryLimit Support in ReplayClient (#442)
  [LocalNet] Add infrastructure to run LLM inference (#508)
  [LocalNet] Documentation for MVT/LocalNet (#488)
  [GATEWAY] Makefile target added to send relays to grove gateway (#487)
  Update README
  [CI] Add GATEWAY_URL envar for e2e tests (#506)
  [Tooling] Add gateway stake/unstake/ logs (#503)
bryanchriswhite added a commit that referenced this pull request May 10, 2024
…cept

* pokt/main:
  Reply to red0ne's comments's in #501
  [Docs] Added QuickStart video (#522)
  [Genesis Config] Set MaxValidators to 1 (#520)
  [Docs] Update `Quickstart` guide to onboard new devs (#501)
  [Code Health] chore: cleanup localnet testutils (#515)
  Zero retryLimit Support in ReplayClient (#442)
@bryanchriswhite bryanchriswhite removed push-image CI related - pushes images to ghcr.io devnet-test-e2e labels May 16, 2024
@github-actions github-actions bot removed the devnet label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request off-chain Off-chain business logic sdk Everything relatd to the sdk
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants