Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclearable locks with mergeability requirements after upgrade #5048

Open
cblkwell opened this issue Oct 29, 2024 · 13 comments
Open

Unclearable locks with mergeability requirements after upgrade #5048

cblkwell opened this issue Oct 29, 2024 · 13 comments
Labels
bug Something isn't working regression Bug introduced in a new version

Comments

@cblkwell
Copy link

cblkwell commented Oct 29, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

Yesterday, we upgraded from v0.28.5 to v0.30.0. Afterwards, it appears Atlantis was not respecting the ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY if the apply previously had failed for any reason. We had PRs where someone had tried to apply and it had failed, either because they didn't regenerate a plan after the upgrade (we are not using EFS) or because of some other reason. It then became impossible to apply the PR because the previous apply had failed, leaving the PR unmergeable.

We tried running atlantis unlock and destroying and recreating the ECS service that Atlantis was running under, and neither fixed the issue. Finally, we rolled back to v0.29.0, and this appears to have fixed the issue; PRs once again allowed us to plan and apply as normal.

Reproduction Steps

Running Atlantis v0.30.0 with branch protection requiring atlantis/apply to be successful in order to merge. Fail an apply, then attempt to replan and apply; you should get this error:

Screenshot 2024-10-29 at 9 47 02 AM

Logs

This appears to be the only log message applying to the issue:

Logs
{"date":1730209894.274917,"log":"{\"level\":\"error\",\"ts\":\"2024-10-29T13:51:34.274Z\",\"caller\":\"events/instrumented_project_command_runner.go:84\",\"msg\":\"Failure running apply operation: Pull request must be mergeable before running apply.\",\"json\":{\"repo\":\"REDACTED_REPO\",\"pull\":\"3866\"},\"stacktrace\":\"github.com/runatlantis/atlantis/server/events.RunAndEmitStats\\n\\tgithub.com/runatlantis/atlantis/server/events/instrumented_project_command_runner.go:84\\ngithub.com/runatlantis/atlantis/server/events.(*InstrumentedProjectCommandRunner).Apply\\n\\tgithub.com/runatlantis/atlantis/server/events/instrumented_project_command_runner.go:46\\ngithub.com/runatlantis/atlantis/server/events.runProjectCmds\\n\\tgithub.com/runatlantis/atlantis/server/events/project_command_pool_executor.go:48\\ngithub.com/runatlantis/atlantis/server/events.(*ApplyCommandRunner).Run\\n\\tgithub.com/runatlantis/atlantis/server/events/apply_command_runner.go:167\\ngithub.com/runatlantis/atlantis/server/events.(*DefaultCommandRunner).RunCommentCommand\\n\\tgithub.com/runatlantis/atlantis/server/events/command_runner.go:389\"}","container_id":"98c140658909440aab128a263758c770-3286682525","container_name":"atlantis","source":"stderr","ecs_cluster":"atlantis-aws","ecs_task_arn":"arn:aws:ecs:us-east-1:150523386789:task/atlantis-aws/98c140658909440aab128a263758c770","ecs_task_definition":"atlantis-aws:23"}
{"date":1730209895.469134,"container_id":"98c140658909440aab128a263758c770-3286682525","container_name":"atlantis","source":"stderr","log":"{\"level\":\"info\",\"ts\":\"2024-10-29T13:51:35.469Z\",\"caller\":\"events/automerger.go:20\",\"msg\":\"not automerging because project at dir \\\"REDACTED_PROJECT\\\", workspace \\\"default\\\" has status \\\"apply_errored\\\"\",\"json\":{\"repo\":\"REDACTED_REPO\",\"pull\":\"3866\"}}","ecs_cluster":"atlantis-aws","ecs_task_arn":"arn:aws:ecs:us-east-1:REDACTED_ACCT:task/atlantis-aws/98c140658909440aab128a263758c770","ecs_task_definition":"atlantis-aws:23"}

Environment details

If not already included, please provide the following:

  • Atlantis version: 0.30.0
  • Deployment method: terraform-aws-atlantis module using Fargate with no EFS partition
  • If not running the latest Atlantis version have you tried to reproduce this issue on the latest version:
  • Atlantis flags:
ATLANTIS_VCS_STATUS_NAME = "atlantis-prod"
ATLANTIS_HIDE_PREV_PLAN_COMMENTS = "true"
ATLANTIS_WRITE_GIT_CREDS = "true"
ATLANTIS_SILENCE_NO_PROJECTS = "true"
ATLANTIS_ALLOW_COMMANDS = "version,plan,apply,unlock,approve_policies,import,state"
ATLANTIS_AUTOMERGE = "true"
ATLANTIS_REPO_CONFIG_JSON = "{\"metrics\":{\"prometheus\":{\"endpoint\":\"/metrics\"}},\"repos\":[{\"allowed_custom_workflows\":true,\"allowed_overrides\":[\"apply_requirements\",\"workflow\"],\"apply_requirements\":[\"mergeable\"],\"delete_source_branch_on_merge\":true,\"id\":\"/.*/\",\"import_requirements\":[\"mergeable\"],\"repo_config_file\":\"atlantis-prod.yaml\"}]}"
ATLANTIS_EMOJI_REACTION = "eyes"
ATLANTIS_REPO_ALLOWLIST = REDACTED_REPO
ATLANTIS_EXECUTABLE_NAME = "atlantis-prod"
ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY = "true"
ATLANTIS_USE_TF_PLUGIN_CACHE = "false"
ATLANTIS_DEFAULT_TF_VERSION = "v1.9.8"
ATLANTIS_PORT = "4141"
ATLANTIS_ATLANTIS_URL = REDACTED_URL

Atlantis server-side config file:

repos:
- id: /.*/
  repo_config_file: atlantis-prod.yaml
  apply_requirements: [mergeable]
  import_requirements: [mergeable]
  allowed_overrides: [apply_requirements, workflow]
  allowed_custom_workflows: true
  delete_source_branch_on_merge: true
metrics:
  prometheus:
    endpoint: "/metrics"

Repo atlantis.yaml file:

This is just version: 3 followed by our list of projects.

@cblkwell cblkwell added the bug Something isn't working label Oct 29, 2024
Copy link

dosubot bot commented Oct 29, 2024

Hello, @cblkwell! I'm here to assist you with any bugs, questions, or contributions you may have. Let's work together to resolve your issue.

There are a couple of similar issues that might be relevant to your situation:

  1. Apply fails on Gitlab 16.4.0 with "Pull request must be mergeable before running apply." - This open issue discusses problems with mergeability checks in GitLab, which might have some parallels with your GitHub issue. A suggested workaround involves manually updating the commit status using the GitLab API, which might inspire a similar approach for GitHub [1].

  2. GitHub Mergability check should support ruleset bypasser status - This open issue addresses the need for Atlantis to support certain GitHub ruleset bypass statuses, which could be related to your problem with the ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY flag not functioning as expected [2].

Regarding the ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY flag, it is designed to bypass the mergeability check related to the atlantis/apply status. However, it seems that in version 0.30.0, this behavior might not be working as intended if a previous apply operation failed. This could be due to changes in how Atlantis handles mergeability checks or failed apply operations in this version.

Rolling back to version 0.29.0 appears to have resolved the issue for you, indicating a potential regression or change in behavior in version 0.30.0. It might be helpful to monitor the Atlantis repository for any updates or fixes related to this issue.

To continue talking to Dosu, mention @dosu.

@stasostrovskyi
Copy link
Contributor

stasostrovskyi commented Oct 29, 2024

If you are running Atlantis using github app, you may need to give it new permissions - Actions/Read. See https://www.runatlantis.io/docs/access-credentials.html#permissions

@cblkwell
Copy link
Author

I'll check (I don't have the permissions necessary to do so in our org). Was this something added in 0.30.0? If that's the case, it's probably a good thing to add breaking changes like this to the release notes. I don't see anything in https://github.com/runatlantis/atlantis/releases/tag/v0.30.0 describing this change.

@and-ku
Copy link

and-ku commented Oct 29, 2024

Im having similar issue but mine throws an error heres my Issue #5038

@ajax-ryzhyi-r
Copy link
Contributor

ajax-ryzhyi-r commented Oct 31, 2024

We're experiencing the same issue, despite our GitHub App having all the required permissions. When one of the required checks fails and is then successfully re-run for the same commit, Atlantis still doesn't allow us to apply the PR. Instead, it displays the error mentioned in this issue.

Here's what I found: We have a PR with one required check that passed after a few reruns. It's now in passed state:
Screenshot 2024-10-31 at 10 53 15

When we query check runs for the commit as Atlantis does, we see the following response from the GitHub GraphQL API:

 "nodes": [
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "FAILURE"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    }
]

GitHub returns all check runs, including failed ones, in chronological order. After investigating the Atlantis code to understand how it evaluates a commit's PR mergeability status, I found that the issue appears to be in this function: https://github.com/runatlantis/atlantis/blob/main/server/events/vcs/github_client.go#L718-L732

Atlantis iterates through the checkRuns here and determines the required check status based on the first run in the array ignoring the rest of the runs. In our case, it's FAILURE.

The fix seems straightforward. We simply need to evaluate the status of the last checkRun instead of the first one.
If the maintainers approve of this proposed solution, I'd be happy to submit a pull request with these changes.

@ajax-ryzhyi-r
Copy link
Contributor

We're experiencing the same issue, despite our GitHub App having all the required permissions. When one of the required checks fails and is then successfully re-run for the same commit, Atlantis still doesn't allow us to apply the PR. Instead, it displays the error mentioned in this issue.

Here's what I found: We have a PR with one required check that passed after a few reruns. It's now in passed state: Screenshot 2024-10-31 at 10 53 15

When we query check runs for the commit as Atlantis does, we see the following response from the GitHub GraphQL API:

 "nodes": [
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "FAILURE"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    }
]

GitHub returns all check runs, including failed ones, in chronological order. After investigating the Atlantis code to understand how it evaluates a commit's PR mergeability status, I found that the issue appears to be in this function: https://github.com/runatlantis/atlantis/blob/main/server/events/vcs/github_client.go#L718-L732

Atlantis iterates through the checkRuns here and determines the required check status based on the first run in the array ignoring the rest of the runs. In our case, it's FAILURE.

The fix seems straightforward. We simply need to evaluate the status of the last checkRun instead of the first one. If the maintainers approve of this proposed solution, I'd be happy to submit a pull request with these changes.

@jamengual @nitrocode What do you think?

@lukemassa
Copy link
Contributor

👋 @jamengual asked me to help coordinate this issue.

@ajax-ryzhyi-r I agree that's very likely where this logic is happening, and it also seems like that method was introduced in 0.30 so it fits with @cblkwell 's report that downgrading to 0.29 fixes the problem.

I haven't dug into the code yet to tell whether the specific solution you mention will fix the problem, but feel free to submit a PR and I can test and review!

@stasostrovskyi
Copy link
Contributor

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

@ajax-ryzhyi-r
Copy link
Contributor

ajax-ryzhyi-r commented Nov 1, 2024

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

We're using GitHub Cloud and haven't encountered issues (or haven't noticed) with the atlantis/apply status check. However, we're facing problems with other required checks, specifically GitHub Actions workflow checks. If the required GitHub Actions workflow check fails and we rerun it successfully, Atlantis still treats the PR as unmergeable. I can provide the GraphQL query I used for my investigation if you'd like

@ajax-ryzhyi-r
Copy link
Contributor

👋 @jamengual asked me to help coordinate this issue.

@ajax-ryzhyi-r I agree that's very likely where this logic is happening, and it also seems like that method was introduced in 0.30 so it fits with @cblkwell 's report that downgrading to 0.29 fixes the problem.

I haven't dug into the code yet to tell whether the specific solution you mention will fix the problem, but feel free to submit a PR and I can test and review!

Thanks in advance for looking into this issue. I'll try to provide the fix mentioned in my previous comment yesterday 😊

@henriklundstrom
Copy link
Contributor

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

We're using GitHub Cloud and haven't encountered issues (or haven't noticed) with the atlantis/apply status check. However, we're facing problems with other required checks, specifically GitHub Actions workflow checks. If the required GitHub Actions workflow check fails and we rerun it successfully, Atlantis still treats the PR as unmergeable. I can provide the GraphQL query I used for my investigation if you'd like

@ajax-ryzhyi-r I'm unable to reproduce the behaviour of the GraphQL API you describe. When I rerun a workflow on the same commit, only the latest run appears in the response. Could you please provide me the query you used?

@ajax-ryzhyi-r
Copy link
Contributor

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

We're using GitHub Cloud and haven't encountered issues (or haven't noticed) with the atlantis/apply status check. However, we're facing problems with other required checks, specifically GitHub Actions workflow checks. If the required GitHub Actions workflow check fails and we rerun it successfully, Atlantis still treats the PR as unmergeable. I can provide the GraphQL query I used for my investigation if you'd like

@ajax-ryzhyi-r I'm unable to reproduce the behaviour of the GraphQL API you describe. When I rerun a workflow on the same commit, only the latest run appears in the response. Could you please provide me the query you used?

@henriklundstrom @stasostrovskyi Sorry for misleading you. Upon reviewing my query, I realized my initial understanding was incorrect. The issue occurs when the workflow is run multiple times for single commits, not when it is rerun for a single commit.

In our case, we have a workflow that validates PRs (checking title, checklists, labels, etc.) with the following triggers:

name: Validate pull request
on:
  pull_request:
    types: [edited, opened, reopened, synchronize]
    branches:
      - master

When this workflow fails, the PR author edits the PR to address the issues, and GitHub automatically triggers the workflow again, creating new run (For some reasonsI thought it reruns failed run ). After these steps, Atlantis is unable to apply the PR, reporting the error mentioned earlier.

However, the proposed fix still appears relevant. I'll only remove the part related to statusContexts, as the issue seems to be specifically about checkRuns (GitHub Actions workflow checks), and create a new PR with the fix for checkRuns only.

I'm not sure our issue is the same as originally mentioned here.

Anyway, here is query I used:

gh api graphql -f query='
query($owner: String!, $name: String!, $number: Int!, $ruleCursor: String, $contextCursor: String) {
  repository(owner: $owner, name: $name) {
    pullRequest(number: $number) {
      reviewDecision
      baseRef {
        branchProtectionRule {
          requiredStatusChecks {
            context
          }
        }
        rules(first: 100, after: $ruleCursor) {
          pageInfo {
            endCursor
            hasNextPage
          }
          nodes {
            type
            repositoryRuleset {
              enforcement
            }
            parameters {
              ... on RequiredStatusChecksParameters {
                requiredStatusChecks {
                  context
                }
              }
              ... on WorkflowsParameters {
                workflows {
                  path
                }
              }
            }
          }
        }
      }
      commits(last: 1) {
        nodes {
          commit {
            statusCheckRollup {
              contexts(first: 100, after: $contextCursor) {
                pageInfo {
                  endCursor
                  hasNextPage
                }
                nodes {
                  __typename
                  ... on CheckRun {
                    name
                    status
                    conclusion
                    detailsUrl
                  }
                  ... on StatusContext {
                    context
                    state
                    targetUrl
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}' -F owner='<repo_owner>' -F name='<repo_name>' -F number='<pr_number>'

@henriklundstrom
Copy link
Contributor

@ajax-ryzhyi-r Okay I see, it happens not on rerun, but on re-trigger on the same commit. I was indeed able to reproduce it by closing and reopening a pull request without pushing any new commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Bug introduced in a new version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants