-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Athena query timeout - 504 #319
Comments
Thank you for reporting this here @katebrenner. We are still experiencing the issue. |
Hi @dcram ! I investigated this, and it seems like a lot of this is related to Athena behavior. The “HTTP 504 Gateway Timeout” comes from AWS's load balancing and I found these docs from AMG about resolving it: https://repost.aws/knowledge-center/grafana-504-timeout-vpc. There is also information about how to tune Athena data and queries to improve the response time https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html. My understand of the docs is that when the queries are initially run, Athena needs to assign resources, which is why they’re slow for the first query, but improve afterwards. |
We experience something similar but not sure if the links mentioned above describe the issue. Our setup. Below part of our helm chart to configure the above
When creating a datasource manually in Grafana it always works with (access and secret) keys. But when manually creation a datasource using the workspace IAM role (ec2_iam_role provider) it is impossible to get it working. Looks like this timeout issue is worsened by the order in which values are entered. So only when creating via automation the iam_role works but with this timeout issue. Also it does not matter if we fill in the "Assume Role ARN" field.
And once the dataprovider is working we have never experiences issues with actual queries. We have also tried this with different versions of the prometheus stack and thus trying with Grafana 9, 10 and 11. |
@aligthart it's very interesting that you get a 500 when using |
I started with the default provider but then I ran into the same problems. Though not in my helm config above, I also enabled debug mode for the logging. I noticed some strange code in the plugin when using the ec2_iam_role where it first does an auth request to a hardcoded US region and only later does the real auth request. Not the expert here though.... And there are now access and secret key on our kubernetes worker nodes. So the "keys" auth provider would never work. I only used that for testing/debug purpose. Eventually we are only interesting in a setup with working IAM role while assuming an ARN pointing to another account. |
@aligthart I tried spinning up an ec2 instance with grafana 9 and athena 2.17.1, enabling both default and ec2_iam_role, and both auth methods worked for me. So I'm not sure what to make of this. Do you have more information about when you see these 500s? |
Hi, Sorry for not being very responsive.... The errors appear on the grafana page where you manually create a datasources. This does not happen when using "keys". Then all works fine. To work arround this manual datasource creation, I automated the config in the grafana.ini. Then the datasource is properly created (with all the settings I want) and eventually (not sure why not instantly) it starts working. Yes, I did look at the vpc help page. Maybe one more thing to add on our setup. So this is also in our grafana.ini
In developer tools I see the grafana datasource doing calls to this endpoint. |
Hi @aligthart ! Based on the fact that our default session duration is 15m, and it sounds like you’re able to connect for a full day after it errors in the morning, I’m not sure the issue is in the datasource plugin. We should be expiring sessions every 15 minutes, and they’ll attempt to connect with the same settings every time, so I would expect it to fail every 15m instead of every morning if it was a datasource plugin problem. Both our prometheus and Athena datasource plugins use the same authentication code, so we can’t use that to determine anything, but we may be able to use the AWS CLI to test where the error is coming from. Can you try configuring the AWS CLI with your IAM role and running the command If that doesn’t error, getting the Grafana logs may give us a better idea of what’s happening. If you configure grafana with log level debug and get the logs from the time of the timeouts that could help us reproduce/debug this. |
What happened:
Users are reporting experiencing slow Athena dashboard loading, on the first loading. "After 5-10 minutes of manual reloads and several 504 Gateway Timeout errors, we finally get all our dashboards working fine for the rest of the day." (grafana/grafana#71946 (comment)) and #99 (comment)
What you expected to happen:
Not this........
The text was updated successfully, but these errors were encountered: