Skip to content

Conversation

@bertiethorpe
Copy link
Contributor

@bertiethorpe bertiethorpe commented Nov 4, 2025

The ansible/site.yml playbook now locks instances to prevent inadvertent changes/deletion to cluster instances. Instances are automatically unlocked by the ansible/adhoc/rebuild-via-slurm.yml. Instances must be manually unlocked using the ansible/adhoc/unlock.yml playbook before running tofu apply, tofu destroy or the development ansible/adhoc/rebuild.yml playbook.


An image build is not required to be commited in this PR, but should be carried out to test the revised code.

This only protects instances. Volumes for production environments should not be tofu-controlled as per the production docs.

Some alternative approaches were considered and rejected:

  • Unlocking and relocking compute nodes from the RebootProgram: This is not feasible as the rebuild call returns immediately, and we do not want to block the script checking for rebuild completion.
  • Preventing OpenTofu operations (apply, destroy etc) in a protected environment without some manual acceptance (in addition to the confirmation OpenTofu always prompts for). This can be implemented e.g. using variable conditions, but has to be done in the "root" module (i.e. environments/*/tofu/), not in the cluster module, as there is no way to directly set variables for the latter. The best approach at present would seem to be to define an extra cookiecutter file for tofu:
variable "allow_tofu" {
  type = bool
  description = "Set to true to acknowledge the current environment is protected (e.g. production)"
  default = false
  validation {
    condition = anytrue(
      [
        ! contains(var.protected_environments, basename(var.environment_root)),
        var.allow_tofu
      ]
    )
    error_message = "Running tofu in a protected environment requires var.allow_tofu = true"
  }
}

variable "protected_environments" {
  type = list(string)
  default = ["production"]
}

This results in an error message like this if in a protected env:

tofu apply
...
Planning failed. OpenTofu encountered an error while generating this plan.

╷
│ Error: Invalid value for variable
│ 
│   on protected_environments.tf line 1:
│    1: variable "allow_tofu" {
│     ├────────────────
│     │ var.allow_tofu is false
│     │ var.environment_root is "/home/steveb/Documents/smslabs/ansible-slurm-appliance/environments/.stackhpc"
│     │ var.protected_environments is list of string with 1 element
│ 
│ Running tofu in a protected environment requires var.allow_tofu = true
│ 
│ This was checked by the validation rule at protected_environments.tf:5,3-13.

and actually running a plan/apply/delete requires using tofu apply -var=allow_tofu=true.

this would only apply to new environments, and would have to be manually-copied in to existing environments to get the benefit. Given this would not (on its own) protect from e.g. CLI/Horizon changes, it does not currently seem worth implementing.

@sjpb
Copy link
Collaborator

sjpb commented Nov 7, 2025

Should probably do a fatimage build for this to check it doens't somehow break that. We don't need to bump the image though.

Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review

@bertiethorpe bertiethorpe force-pushed the feat/prevent-prod-changes branch from e355f49 to 6119c22 Compare November 12, 2025 18:49
@bertiethorpe bertiethorpe marked this pull request as ready for review November 13, 2025 09:46
@bertiethorpe bertiethorpe requested a review from a team as a code owner November 13, 2025 09:46
@bertiethorpe bertiethorpe requested a review from sjpb November 13, 2025 09:46
@bertiethorpe bertiethorpe requested a review from sjpb November 13, 2025 14:02
@sjpb sjpb force-pushed the feat/prevent-prod-changes branch 2 times, most recently from 7947368 to b0afab5 Compare January 28, 2026 12:12
@sjpb sjpb force-pushed the feat/prevent-prod-changes branch from b0afab5 to 80d3b4c Compare January 28, 2026 12:15
@sjpb sjpb changed the title Add safety checks to site production environments & Lock instances after site.yml Lock instances to prevent accidental changes Jan 28, 2026
@sjpb
Copy link
Collaborator

sjpb commented Jan 28, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants