Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent github check "Google Cloud Build / taskcluster (taskcluster-dev)" #6854

Open
petemoore opened this issue Feb 22, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@petemoore
Copy link
Member

petemoore commented Feb 22, 2024

Describe the bug
Getting the following failure in github check "Google Cloud Build / taskcluster (taskcluster-dev)" intermittently:

Step #9 - "Smoketest": [08:29:24] __version__ endpoint for github: failed
Step #9 - "Smoketest": [08:29:24] __version__ endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)
Step #9 - "Smoketest": [08:29:24] __lbheartbeat__ endpoint for github: failed
Step #9 - "Smoketest": [08:29:24] __lbheartbeat__ endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)
Step #9 - "Smoketest": [08:29:24] __heartbeat__ endpoint for github: failed
Step #9 - "Smoketest": [08:29:24] __heartbeat__ endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)
Step #9 - "Smoketest": [08:29:25] Ping health endpoint for github: failed
Step #9 - "Smoketest": [08:29:25] Ping health endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)

See e.g. https://github.com/taskcluster/taskcluster/runs/21852513428 which is a github check that ran for a commit on the main branch.

@petemoore petemoore added the bug Something isn't working label Feb 22, 2024
@petemoore petemoore changed the title Intermittent Intermittent github check "Google Cloud Build / taskcluster (taskcluster-dev)" Feb 22, 2024
@lotas
Copy link
Contributor

lotas commented Feb 22, 2024

Thanks Pete!
I think we can add some timeouts or retries there, not sure why it took longer for github service to become ready

@petemoore
Copy link
Member Author

Looks like this is triggered from cloudbuild.yaml and is running

corepack enable && yarn && yarn smoketest

So yarn smoketest seems to be the culprit.

@petemoore
Copy link
Member Author

petemoore commented Feb 22, 2024

Looks like there are no retries here:

const resp = await got.get(healthcheck);

const resp = await got.get(healthcheck);

const resp = await got(dunderVersion, { throwHttpErrors: true });

Not sure if there are other places with the same problem.

@lotas Do we have a prior art for wrapping "got" http requests with exponential backoff?

@petemoore
Copy link
Member Author

I see we have our own function in the node.js taskcluster client: clients/client/src/retry.js which we could probably pull out into its own module.

Alternatively, ChatGPT suggests axios. Any thoughts/preferences?

@lotas
Copy link
Contributor

lotas commented Feb 22, 2024

Thanks Pete,
I don't think there's an issue with this particular implementation of smoketest, as it's being used same way on other envs during deployments and seemed to be running just fine.

There was probably some precondition existing on our dev cluster which delayed the deployment or roll out of newer versions.
I'll have a look, maybe just a timeout before running smoketest will do the trick

@lotas
Copy link
Contributor

lotas commented Feb 29, 2024

To add to my last comment, I don't think timeouts would help. If test fails, it means deployment went wrong, and it is not intermittent, but rather tell about the problem that happened during deployment.
And this should be investigated individually to see what caused that particular problem. Increasing timeouts or adding backoffs would only delay the failure in such cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants