Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config validation for approver operate permissions can block all new config changes #11497

Open
sandro-h opened this issue Apr 19, 2023 · 6 comments

Comments

@sandro-h
Copy link
Contributor

sandro-h commented Apr 19, 2023

Issue Type
  • Bug Report
Summary

The config validation which checks that a stage approver has operate permissions on the pipeline group can cause a previously valid config to become invalid, which prevents any further changes of the config. When this happens, changes from any config repo (even without any approver lists) are rejected by GoCD, effectively blocking all new changes until the validation problem (which only affects a specific config repo) is resolved.

Basic environment details
  • Go Version: 23.1.0
  • JAVA Version: 17.0.6
  • OS: Linux 3.10.0-1160.81.1.el7.x86_64
Steps to Reproduce
  1. Set up two YAML config repos repoA and repoB:
    • repoA defines pipelineA in pipelineGroupA
    • repoB defines pipelineB in pipelineGroupB
  2. Create a GoCD role and give it operate permissions on pipelineGroupA and pipelineGroupB
  3. Create a GoCD user and give it the role
  4. Modify config repoA to restrict approval of the first stage to this user
stages:
  - build:
      approval:
        type: manual
        users: [my_user]
  1. Take the role away from the user
  2. Make a harmless change in config repoB
Expected Results

The change from step 7 is valid and merged in the config, since repoB doesn't even set any approval lists.

Actual Results

The change fails to merge and we get an error for repoB that is actually about repoA:

1. User "my_user" who is not authorized to operate pipeline group `pipelineGroupA` can not be authorized to approve stage;; 
Any other info

In my rudimentary understanding, the core problem is that a previously valid config can become invalid due to "outside" circumstances (removing the role from a user, which could also happen in e.g. LDAP). When this happens, any new change from other config repos will make GoCD perform a validation of the existing config, which fails due to these circumstances.

I'm not sure how this could be fixed - the config merge & validation logic looks quite complex. The simplest solution would be dropping that approver-has-operate-permission validation. It shouldn't cause security problems, since - well - the user doesn't have operate permissions anyway, but it makes it harder to find subtle misconfiguration problems, I suppose.

The current situation is regularly causing problems for our instance, as we started implementing stricter permission handling. Since we have hundreds of config repos managed by separate teams it's not as simple as quickly fixing the config to remove the faulty approver entry. It's made even worse when using and referencing "ephemeral" roles like LDAP or SAML roles, which users can intermittently lose until they log in again. This (among other reasons) forced us to switch to using only local GoCD roles and implementing our own role sync mechanism.

@chadlwilson
Copy link
Member

chadlwilson commented Apr 20, 2023

My gut feeling is that this should not be a fatal validation error for the reason you note. A previously valid config shouldn't be able to become invalid independently.

It'd be better if there was a way to communicate 'likely pipeline misconfiguration' warnings like this independent of failure but that's likely a much bigger challenge.

On the other hand I haven't thought through the consequences of this in any detail, so would need more analysis.

@GaneshSPatil
Copy link
Contributor

@chadlwilson @sandro-h

I believe the problem exists in the "Step 6" of removing the role from the user.

Ideally, GoCD should not allow removal of the role from the user, as this role removal will lead to the mentioned error.

To fix this particular problem, I believe, we can add a specific validation while roles are being updated to make sure that config is valid.
However, as you mentioned, with external auth mechanism (LDAP), it may get tricker and may need further understanding.

@chadlwilson
Copy link
Member

chadlwilson commented Apr 23, 2023

Thanks @GaneshSPatil With an external rbac system you kind of have to let that happen or assume it will happen (either due to some kind of 'use it or lose it' policy or perhaps even an employee/user leaving)

My perspective is that with these types of things I think one probably has to ensure it does the right kind of thing at runtime (if the user logs in they can't actually do anything which likely already works so it 'fails safe') but generally expect those with permissions to commit to the config repository or overall admins to correct the problem when detected rather than do this kind of 'complete' validation.

@mattgauntseo-sentry
Copy link

I've started to run into this issue as well.

Surprisingly, the config repo containing the pipeline with the problematic role isn't the repo GoCD is erroring on, instead it's failing on 2 other config repos, causing both to update.

Fixing the role issue doesn't seem to have an immediate affect either, so I'm in a limbo state.

@sandro-h
Copy link
Contributor Author

I looked into this a bit more:

  • Turning this validation into a warning would indeed be a lot of effort since the validation code only supports errors and changing that requires touching large swaths of code.
  • I double-checked that even without the validation, there is still a runtime check for the authorization. I.e. if someone is in the approver list, but doesn't have operate permission on the pipeline, they cannot start the pipeline/stage. I would've expected this anyway, since the validation doesn't "fix" the invalid state, so there needs to be runtime check.

So I have two simpler proposals:

  1. Remove the validation completely. I personally don't see a lot of value (especially opposite the downsides). Listing users who can't operate the pipeline in the approval list doesn't break anything, it just doesn't have any effect. And with or without approval list, if a user who only has view permission tries to start it, they get the same error.

  2. If not outright remove it, give us the option to disable it via config flag.
    I prototyped it with a validateApproverPermissions attribute on the security element:

    <security validateApproverPermissions="false">

    Approval::validateOperatePermissions

    SecurityConfig serverSecurityConfig = validationContext.getServerSecurityConfig();
    if (!serverSecurityConfig.isValidateApproverPermissions()) {
      return;
    }

@chadlwilson
Copy link
Member

My inclination is towards option 1 rather than adding complexity, but I also haven't thought through in much detail what this might mean and a bit worried about unintended consequences and making it harder to configure things correctly.

Mitigation would be to ensure the error messages are good when there is a missing permission, and/or ensuring something on the UI or other config repo plugin docs highlight the need for the operate permission when configuring stage approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants