Zero retryLimit Support in ReplayClient #442

ezeike · 2024-03-19T19:10:06Z

Summary

Enables the replay client to continuously attempt to reconnect to the pocket node.

Issue

Current implementation of the replay client will panic after 10 failed connection attempts.

Type of change

Select one or more:

Testing

Unit Tests: make go_develop_and_test
LocalNet E2E Tests: make test_e2e
DevNet E2E Tests: Add the devnet-test-e2e label to the PR. THIS IS VERY EXPENSIVE, so only do it after all the reviews are complete.
Documentation changes: make docusaurus_start

Sanity Checklist

I have tested my changes using the available tooling
I have commented my code
I have performed a self-review of my own code; both comments & source code
I create and referenced any new tickets, if applicable
I have left TODOs throughout the codebase, if applicable

h5law

LGTM Just want you to add a TODO then good to go!

pkg/client/events/replay_client.go

h5law

After a second look at

poktroll/pkg/client/events/replay_client.go

Line 198 in e0f7978

if publishError != nil {

you need to change the logic in the retry.OnError method as well as this or you will panic if the connection ever closes and needs to be reestablihed

Olshansk

@bryanchriswhite Added you as as a reviewer instead of @red-0ne since you have more context into the reply client and wanted to get your 👀 on this.

pkg/retry/retry.go

ezeike · 2024-03-20T16:14:07Z

Some questions before more changes are made:

Under what conditions is errCh (in OnError) closed? If that does happen, can we count that as an error and retry the same as if it was an error that came through errCh?
Do we need a retryLimit at all? At the moment it does not appear we do, is there a future scenario where we might?

Olshansk · 2024-03-21T03:52:24Z

@bryanchriswhite Can you PTAL at @ezeike questions before you go OOO?

h5law · 2024-03-21T05:21:36Z

@ezeike

Under what conditions is errCh (in OnError) closed? If that does happen, can we count that as an error and retry the same as if it was an error that came through errCh?

The errCh in retry.OnError closes after we have reached the max number of retry attempts. retry.OnError essentially takes a work function that returns an error channel and does this until it errors then retries a certain number of times. See [1] for the work function that is supplied by the replay client. The two instances where it returns an error on the channel (causing a retry in retry.OnError are:

We were unable to subscribe to the events query client for some reason.
The events query client closed for some reason.

These may be due to network issues or some unexpected error or whatever but these are naturally possibilities. If either of these reasons do happen we are already treating them as errors but in order for the retry logic to work using a channel makes sense.

I am a little unsure what you mean by retry in the same way but treat them as errors - it sounds like that would just be the same thing to me?

Do we need a retryLimit at all? At the moment it does not appear we do, is there a future scenario where we might?

The retryLimit is essentially there to stop an infinite loop occurring if say a connection to the full-node providing the on-chain data a gateway is faulty or the node goes offline.

By setting the eventsBytesRetryLimit to ^uint64(0) and removing the if retryLimit > 0 check you will get the same result - infinitely retry (essentially). As soon as you have a successful connection to an events query client for 10 seconds your retryCount resets so you can again fail and retry to connect over and over.

@ezeike Are you finding that your connections are failing a lot and you are reaching the 10 retry limit often? If so I think investigating what is causing that could help us figure out a better solution than an infinite loop.

[1]

poktroll/pkg/client/events/replay_client.go

Line 207 in e0f7978

 func (rClient *replayClient[T]) retryPublishEventsFactory(ctx context.Context) func() chan error { 

ezeike · 2024-03-21T15:35:15Z

@ezeike

Under what conditions is errCh (in OnError) closed? If that does happen, can we count that as an error and retry the same as if it was an error that came through errCh?

The errCh in retry.OnError closes after we have reached the max number of retry attempts. retry.OnError essentially takes a work function that returns an error channel and does this until it errors then retries a certain number of times. See [1] for the work function that is supplied by the replay client. The two instances where it returns an error on the channel (causing a retry in retry.OnError are:

We were unable to subscribe to the events query client for some reason.

The events query client closed for some reason.

These may be due to network issues or some unexpected error or whatever but these are naturally possibilities. If either of these reasons do happen we are already treating them as errors but in order for the retry logic to work using a channel makes sense.

I am a little unsure what you mean by retry in the same way but treat them as errors - it sounds like that would just be the same thing to me?

Errors are received through the error channel and those trigger a Sleep and a retry. However when the error channel is closed OnError returns and the SDK panics. A first thought was just to eliminate that return so the execution continues on to the same Sleep and retry logic. workFn() is called every loop and errCh created anew anyway.

The current design is the SDK is to take care of connection details and do its best to ensure there's always a connection to a node. We should keep this as much as possible to keep using the SDK as simple as we can.

Either:

SDK is modified to remove the panic and return an error to the gateway when retry limit reached. Then the gateway can make a new SDK and try again.
SDK is modified to allow unlimited retries.

At the moment we should pick the quickest one that solves the issue so we can get our gateway on testnet as soon as possible. The second option was quicker so I went with that.

There are other discussions to have around who reconnects, how to support using multiple nodes and how configurable the SDK should be but for now we want an MVP ASAP.

Do we need a retryLimit at all? At the moment it does not appear we do, is there a future scenario where we might?

The retryLimit is essentially there to stop an infinite loop occurring if say a connection to the full-node providing the on-chain data a gateway is faulty or the node goes offline.

From a gateway perspective we want an infinite loop. SDK only allows for specifying a single node and it should do its best to always be connected to that node.

The infinite loop doesn't have to exist in OnError though, we could do it in goPublishEvents in this case.

If we don't have an infinite loop the gateway will have to handle connection failures and re-create the SDK. That's reasonable too, but shifts some work to gateway devs.

By setting the eventsBytesRetryLimit to ^uint64(0) and removing the if retryLimit > 0 check you will get the same result - infinitely retry (essentially). As soon as you have a successful connection to an events query client for 10 seconds your retryCount resets so you can again fail and retry to connect over and over.

If we do this retryCount >= retryLimit will be true on the first pass and return. We'd need another tweak.

@ezeike Are you finding that your connections are failing a lot and you are reaching the 10 retry limit often? If so I think investigating what is causing that could help us figure out a better solution than an infinite loop.

I saw the error when I didn't have localnet running. Moving forward though we can't have the SDK panic when a node connection is lost or not available.

h5law · 2024-03-21T16:23:28Z

I'll try keep this one shorter @ezeike.

Errors are received through the error channel and those trigger a Sleep and a retry. However when the error channel is closed OnError returns and the SDK panics. A first thought was just to eliminate that return so the execution continues on to the same Sleep and retry logic. workFn() is called every loop and errCh created anew anyway.
The current design is the SDK is to take care of connection details and do its best to ensure there's always a connection to a node. We should keep this as much as possible to keep using the SDK as simple as we can.
Either:
SDK is modified to remove the panic and return an error to the gateway when retry limit reached. Then the gateway can make a new SDK and try again.
SDK is modified to allow unlimited retries.
At the moment we should pick the quickest one that solves the issue so we can get our gateway on testnet as soon as possible. The second option was quicker so I went with that.
There are other discussions to have around who reconnects, how to support using multiple nodes and how configurable the SDK should be but for now we want an MVP ASAP.

Thoughts about adding this

if err := recover(); err != nil {
	errMsg := err.Error()
	// You could be extra verbose by checking for more parts of the panic's error
	// message by including the retry limit and error from the retry.OnError but
	// this is the error message given to the panic we are talking about
	// see: ./pkg/client/events/replay_client.go#L199
	if strings.Contains(errMsg, ".goPublishEvents should never reach this spot:") {
		restartSDK() // or however you do this I am unfamiliar but rest is correct
	} else {
		panic(err)
	}
}

This means we can still have the limit - fixes your problem of needing and infinite loop as is very minimal code change quite simple in fact. Placing this in the SDK somewhere it is appropriate will be pretty easy IMO @red-0ne and you @ezeike have more context here than I do.

If we do this retryCount >= retryLimit will be true on the first pass and return. We'd need another tweak.

If we don't go with the above solution we will count down instead of count up if that makes sense as we will use the max uint64 value counting down is the only real option - again minimal changes.

ezeike · 2024-03-27T15:10:22Z

IMO the proper approach here is to refactor replay client to remove the panics that are present in the package.

As a consequence there is now no way to limit the number of connection attempts the replay client makes. We need to have some feedback before we allow limiting reconnection attempts.

Ideally we give an error channel to replay client to signal to the caller that it is giving up but that is a larger PR and would require more complexity in our gateway.

We (the backend team) have decided we'd like to the panics and move forward with testing as-is. We can re-evaluate the error channel at a later date if the need becomes apparent.

If we were to implement an error channel (and a config option for retryLimit) we'd need to modify these three clients:

BlockClient, DelegationClient, TxClient

their associated supplier functions, and both gateways (appgate and grove gateway) to handle the config and extra state management.

pkg/client/events/replay_client.go

pkg/retry/retry_test.go

Olshansk

@ezeike

Please merge with main
See bryan's comment
Make sure E2E tests pass on DevNet

Will approve after ☝️

Co-authored-by: Daniel Olshansky <[email protected]>

Co-authored-by: Daniel Olshansky <[email protected]> Co-authored-by: ezeike <[email protected]>

…lient * pokt/main: [LocalNet][Hotfix] Turn off pproff by default (#500) [Tooling] Add pprof endpoints and documentation (#484) [Documentation] Basic blockchain operations (#454) [BlockClient] Refactor BlockClient to fetch latest block at init (#444) [Tooling] Add claim/proof/settlement dashboard & link (#479) [Config] feat: Simplify relay miner config (#477)

pkg/relayer/session/session.go

Olshansk · 2024-04-29T17:56:12Z

@ezeike @bryanchriswhite Could one of you please tend to the remaining TODO_IN_THIS_ comments?

Olshansk

Approving now that I see E2E tests passing in 0693ca1

* pokt/main: [Code Health] chore: cleanup localnet testutils (#515) Zero retryLimit Support in ReplayClient (#442) [LocalNet] Add infrastructure to run LLM inference (#508) [LocalNet] Documentation for MVT/LocalNet (#488) [GATEWAY] Makefile target added to send relays to grove gateway (#487) Update README [CI] Add GATEWAY_URL envar for e2e tests (#506) [Tooling] Add gateway stake/unstake/ logs (#503)

…cept * pokt/main: Reply to red0ne's comments's in #501 [Docs] Added QuickStart video (#522) [Genesis Config] Set MaxValidators to 1 (#520) [Docs] Update `Quickstart` guide to onboard new devs (#501) [Code Health] chore: cleanup localnet testutils (#515) Zero retryLimit Support in ReplayClient (#442)

ezeike added the enhancement New feature or request label Mar 19, 2024

ezeike added this to the Shannon TestNet milestone Mar 19, 2024

ezeike requested a review from red-0ne March 19, 2024 19:10

ezeike self-assigned this Mar 19, 2024

ezeike force-pushed the zero-retrylimit-replayclient branch from 077fe70 to 7fd7a33 Compare March 19, 2024 19:12

h5law approved these changes Mar 19, 2024

View reviewed changes

pkg/client/events/replay_client.go Outdated Show resolved Hide resolved

h5law previously requested changes Mar 19, 2024

View reviewed changes

Olshansk requested review from bryanchriswhite and removed request for red-0ne March 20, 2024 03:06

Olshansk requested changes Mar 20, 2024

View reviewed changes

pkg/retry/retry.go Outdated Show resolved Hide resolved

ezeike force-pushed the zero-retrylimit-replayclient branch from 3ec9494 to 126353d Compare March 20, 2024 16:14

ezeike force-pushed the zero-retrylimit-replayclient branch 5 times, most recently from f57bd58 to 99cfe9a Compare March 27, 2024 14:43

ezeike requested review from h5law and Olshansk March 27, 2024 15:17

ezeike force-pushed the zero-retrylimit-replayclient branch from 99cfe9a to 396f4c3 Compare March 29, 2024 14:19

ezeike commented Apr 1, 2024

View reviewed changes

pkg/client/events/replay_client.go Outdated Show resolved Hide resolved

Olshansk requested changes Apr 1, 2024

View reviewed changes

pkg/client/events/replay_client.go Outdated Show resolved Hide resolved

pkg/client/events/replay_client.go Outdated Show resolved Hide resolved

pkg/client/events/replay_client.go Outdated Show resolved Hide resolved

bryanchriswhite force-pushed the zero-retrylimit-replayclient branch 2 times, most recently from 5b99b80 to 0d25513 Compare April 3, 2024 10:15

Olshansk added the off-chain Off-chain business logic label Apr 5, 2024

Olshansk reviewed Apr 23, 2024

View reviewed changes

pkg/retry/retry_test.go Outdated Show resolved Hide resolved

Olshansk requested changes Apr 23, 2024

View reviewed changes

ezeike and others added 3 commits April 24, 2024 14:32

Update pkg/retry/retry_test.go

f91721c

Co-authored-by: Daniel Olshansky <[email protected]>

[Config] feat: Simplify relay miner config (#477)

108303a

Co-authored-by: Daniel Olshansky <[email protected]> Co-authored-by: ezeike <[email protected]>

[Tooling] Add claim/proof/settlement dashboard & link (#479)

6680149

ezeike force-pushed the zero-retrylimit-replayclient branch from 8b008c7 to 6680149 Compare April 25, 2024 21:13

bryanchriswhite and others added 5 commits April 26, 2024 09:08

lint fix

d991a18

Type coercion update

09f5fd7

fix: tests

89076c0

fix: tests

008842e

github-actions bot reviewed Apr 26, 2024

View reviewed changes

pkg/relayer/session/session.go Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into zero-retrylimit-replayclient

9a68de3

Godoc comment update

4bf968a

ezeike force-pushed the zero-retrylimit-replayclient branch from e375bf8 to 4bf968a Compare May 2, 2024 18:37

bryanchriswhite and others added 5 commits May 3, 2024 09:24

Merge branch 'main' into zero-retrylimit-replayclient

6209183

Empty commit

0d607ae

Merge branch 'main' into zero-retrylimit-replayclient

47eeadb

Empty commit

1921dd8

Empty commit

0693ca1

Olshansk approved these changes May 7, 2024

View reviewed changes

ezeike merged commit c6cb4d0 into main May 8, 2024
9 checks passed

bryanchriswhite deleted the zero-retrylimit-replayclient branch May 8, 2024 09:25

bryanchriswhite removed push-image CI related - pushes images to ghcr.io devnet-test-e2e labels May 16, 2024

github-actions bot removed the devnet label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero retryLimit Support in ReplayClient #442

Zero retryLimit Support in ReplayClient #442

ezeike commented Mar 19, 2024 •

edited

Loading

h5law left a comment

h5law left a comment

Olshansk left a comment

ezeike commented Mar 20, 2024 •

edited

Loading

Olshansk commented Mar 21, 2024

h5law commented Mar 21, 2024

ezeike commented Mar 21, 2024 •

edited

Loading

h5law commented Mar 21, 2024 •

edited

Loading

ezeike commented Mar 27, 2024 •

edited

Loading

Olshansk left a comment

This comment was marked as outdated.

Olshansk commented Apr 29, 2024 •

edited

Loading

Olshansk left a comment

Zero retryLimit Support in ReplayClient #442

Zero retryLimit Support in ReplayClient #442

Conversation

ezeike commented Mar 19, 2024 • edited Loading

Summary

Issue

Type of change

Testing

Sanity Checklist

h5law left a comment

Choose a reason for hiding this comment

h5law left a comment

Choose a reason for hiding this comment

Olshansk left a comment

Choose a reason for hiding this comment

ezeike commented Mar 20, 2024 • edited Loading

Olshansk commented Mar 21, 2024

h5law commented Mar 21, 2024

ezeike commented Mar 21, 2024 • edited Loading

h5law commented Mar 21, 2024 • edited Loading

ezeike commented Mar 27, 2024 • edited Loading

Olshansk left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Olshansk commented Apr 29, 2024 • edited Loading

Olshansk left a comment

Choose a reason for hiding this comment

ezeike commented Mar 19, 2024 •

edited

Loading

ezeike commented Mar 20, 2024 •

edited

Loading

ezeike commented Mar 21, 2024 •

edited

Loading

h5law commented Mar 21, 2024 •

edited

Loading

ezeike commented Mar 27, 2024 •

edited

Loading

Olshansk commented Apr 29, 2024 •

edited

Loading