Athena query timeout - 504 #319

katebrenner · 2024-03-22T13:49:18Z

What happened:
Users are reporting experiencing slow Athena dashboard loading, on the first loading. "After 5-10 minutes of manual reloads and several 504 Gateway Timeout errors, we finally get all our dashboards working fine for the rest of the day." (grafana/grafana#71946 (comment)) and #99 (comment)

What you expected to happen:
Not this........

dcram · 2024-03-22T15:57:05Z

Thank you for reporting this here @katebrenner.

We are still experiencing the issue.

iwysiu · 2024-05-23T16:08:13Z

Hi @dcram ! I investigated this, and it seems like a lot of this is related to Athena behavior. The “HTTP 504 Gateway Timeout” comes from AWS's load balancing and I found these docs from AMG about resolving it: https://repost.aws/knowledge-center/grafana-504-timeout-vpc. There is also information about how to tune Athena data and queries to improve the response time https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html. My understand of the docs is that when the queries are initially run, Athena needs to assign resources, which is why they’re slow for the first query, but improve afterwards.
I can look into retrying on the Gateway Timeout, but that won’t fix the underlying issue of the queries initially taking a long time.

aligthart · 2024-07-15T08:14:07Z

We experience something similar but not sure if the links mentioned above describe the issue.
For us it depends on what authentication mechanism we use.

Our setup.
Grafana deployed as part of prometheus stack on a kubernetes cluster that is setup with Kops. IAM-role on the workers with a policy to access athena\S3 in another AWS account.

Below part of our helm chart to configure the above

grafana:
  plugins:
    - grafana-athena-datasource
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: Athena
          type: grafana-athena-datasource
          jsonData:
            authType: ec2_iam_role
            assumeRoleArn: arn:aws:iam::xxxxxxxxxxxxxx:role/yyyyyyyyyyyy
            defaultRegion: eu-west-1
            catalog: AwsDataCatalog
            database: 'ourdatabase'
            workgroup: 'primary'
            outputLocation: s3://aws-athena-query-results-xxxxxxxxxxxxxxx-eu-west-1/ourlocation
  "grafana.ini":
    aws:
      allowed_auth_providers: default,keys,credentials,ec2_iam_role

with the above setup we also experience these gateway timeout that magically disappear after 5 a few minutes. Not sure what makes it work.
when changing the authentication provider from "ec2_iam_role" to "keys" then all works fine instantly. The datasource instantly starts returning results from the athena in the other AWS account

When creating a datasource manually in Grafana it always works with (access and secret) keys. But when manually creation a datasource using the workspace IAM role (ec2_iam_role provider) it is impossible to get it working. Looks like this timeout issue is worsened by the order in which values are entered.

So only when creating via automation the iam_role works but with this timeout issue.
No issues at all when using the keys provider.

Also it does not matter if we fill in the "Assume Role ARN" field.

When using the keys provider and assuming the role we see the Athena in the other AWS account. never a timeout
When using the keys provider and not assuming the role we see the Athena in the local AWS account. never a timeout
when using the IAM role provider we see the Athena account in the other AWS account (once we get passed the timeouts)

And once the dataprovider is working we have never experiences issues with actual queries.
Note that we are still trying a first Athena setup. Not sure if\when cached datatesource connections will expire.

We have also tried this with different versions of the prometheus stack and thus trying with Grafana 9, 10 and 11.
Although the grafana UI experience is different the timeouts exist in all versions.

sarahzinger · 2024-07-15T19:30:01Z

@aligthart it's very interesting that you get a 500 when using ec2_iam_role. Do you hit the same issues if you use default as your auth provider? I believe default should also pick up on credentials that are on an ec2 instance.

aligthart · 2024-07-15T19:46:47Z

I started with the default provider but then I ran into the same problems.
For that reason I explicitly enabled the ec2_iam_role provider and started using the providers explicitly (keys and ec2_iam_role) to have a better control on what was used.

Though not in my helm config above, I also enabled debug mode for the logging.
Also I had a look at the github code the the plugin itself.

I noticed some strange code in the plugin when using the ec2_iam_role where it first does an auth request to a hardcoded US region and only later does the real auth request. Not the expert here though....

And there are now access and secret key on our kubernetes worker nodes. So the "keys" auth provider would never work. I only used that for testing/debug purpose. Eventually we are only interesting in a setup with working IAM role while assuming an ARN pointing to another account.

sarahzinger · 2024-07-16T18:50:26Z

@aligthart I tried spinning up an ec2 instance with grafana 9 and athena 2.17.1, enabling both default and ec2_iam_role, and both auth methods worked for me. So I'm not sure what to make of this.

Do you have more information about when you see these 500s?
Do they happen for you when you load the datasource configuration page or when you save the datasource configuration details?
Do they happen on the Explore page or in dashboards?
Did you look into the vpc help page that @iwysiu linked?

aligthart · 2024-07-19T13:57:30Z

Hi,

Sorry for not being very responsive....

The errors appear on the grafana page where you manually create a datasources.
When using the ec2_iam role.
I can configure the the assume role and default region
But as soon as I try to enter any of the Athena details (datasource, database, workgroup) things go wrong. UI hangs and shows an expection dialog after 1 minute. Gateway timeout.

This does not happen when using "keys". Then all works fine.

To work arround this manual datasource creation, I automated the config in the grafana.ini.

Then the datasource is properly created (with all the settings I want) and eventually (not sure why not instantly) it starts working.

Yes, I did look at the vpc help page.
But I do not want to go there right now.
Don' think that is the solution for our problem (I do not see how that wiki would explain different behavior based on credential provider).

Maybe one more thing to add on our setup.
Our Grafana sits behind an nginx ingress controller which we expose via an AWS NLB.

So this is also in our grafana.ini

  grafana.ini: |
    [server]
    domain = our.domain.com
    root_url = https://our.domain.com/grafana

In developer tools I see the grafana datasource doing calls to this endpoint.

iwysiu · 2024-07-30T21:51:20Z

Hi @aligthart ! Based on the fact that our default session duration is 15m, and it sounds like you’re able to connect for a full day after it errors in the morning, I’m not sure the issue is in the datasource plugin. We should be expiring sessions every 15 minutes, and they’ll attempt to connect with the same settings every time, so I would expect it to fail every 15m instead of every morning if it was a datasource plugin problem.

Both our prometheus and Athena datasource plugins use the same authentication code, so we can’t use that to determine anything, but we may be able to use the AWS CLI to test where the error is coming from. Can you try configuring the AWS CLI with your IAM role and running the command aws athena list-data-catalogs? If that gets a timeout, then we know the issue is coming from AWS.

If that doesn’t error, getting the Grafana logs may give us a better idea of what’s happening. If you configure grafana with log level debug and get the logs from the time of the timeouts that could help us reproduce/debug this.

katebrenner added type/bug Something isn't working datasource/Athena labels Mar 22, 2024

aws-ds-token-creator bot added this to AWS Datasources Mar 22, 2024

github-project-automation bot moved this to Incoming in AWS Datasources Mar 22, 2024

This was referenced Mar 22, 2024

Getting 504 Gateway Timeout after 60 seconds of query runtime #99

Closed

Datasources: dataproxy.timeout seems to be ingored grafana/grafana#71946

Closed

kevinwcyu moved this from Incoming to Next in AWS Datasources Mar 22, 2024

iwysiu mentioned this issue Apr 30, 2024

Query are taking too much time and not giving the output result in grafana #322

Closed

iwysiu self-assigned this May 23, 2024

iwysiu moved this from Next to Waiting in AWS Datasources May 23, 2024

sarahzinger moved this from Waiting to Incoming in AWS Datasources Jul 15, 2024

sarahzinger moved this from Incoming to Waiting in AWS Datasources Jul 15, 2024

sarahzinger moved this from Waiting to Incoming in AWS Datasources Jul 16, 2024

sarahzinger moved this from Incoming to Waiting in AWS Datasources Jul 16, 2024

iwysiu moved this from Waiting to Incoming in AWS Datasources Jul 29, 2024

iwysiu moved this from Incoming to Waiting in AWS Datasources Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Athena query timeout - 504 #319

Athena query timeout - 504 #319

katebrenner commented Mar 22, 2024

dcram commented Mar 22, 2024

iwysiu commented May 23, 2024

aligthart commented Jul 15, 2024 •

edited

Loading

sarahzinger commented Jul 15, 2024

aligthart commented Jul 15, 2024 •

edited

Loading

sarahzinger commented Jul 16, 2024 •

edited

Loading

aligthart commented Jul 19, 2024 •

edited

Loading

iwysiu commented Jul 30, 2024

Athena query timeout - 504 #319

Athena query timeout - 504 #319

Comments

katebrenner commented Mar 22, 2024

dcram commented Mar 22, 2024

iwysiu commented May 23, 2024

aligthart commented Jul 15, 2024 • edited Loading

sarahzinger commented Jul 15, 2024

aligthart commented Jul 15, 2024 • edited Loading

sarahzinger commented Jul 16, 2024 • edited Loading

aligthart commented Jul 19, 2024 • edited Loading

iwysiu commented Jul 30, 2024

aligthart commented Jul 15, 2024 •

edited

Loading

aligthart commented Jul 15, 2024 •

edited

Loading

sarahzinger commented Jul 16, 2024 •

edited

Loading

aligthart commented Jul 19, 2024 •

edited

Loading