Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release: v5.3.0 #2363

Closed
16 of 18 tasks
craddm opened this issue Jan 15, 2025 · 3 comments
Closed
16 of 18 tasks

Release: v5.3.0 #2363

craddm opened this issue Jan 15, 2025 · 3 comments
Assignees
Labels
release candidate This is a candidate for release

Comments

@craddm
Copy link
Contributor

craddm commented Jan 15, 2025

✅ Checklist

Refer to the Deployment section of our documentation when completing these steps.

  • Consult the data-safe-haven/VERSIONING.md guide and determine the version number of the new release. Record it in the title of this issue
  • Create a release branch called e.g. release-v0.0.1
    • If this is a hotfix release then this branch should be based off latest
    • In all other cases it should be based off develop
  • Draft a changelog for the release similar to our previous releases

For patch releases only

  • Confirm that the resource to which the patch applies can be successfully deployed

For minor releases and above

  • Deploy an SHM from this branch and save a transcript of the deployment logs
  • Deploy a tier 2 SRE from this branch and save the transcript of the deployment logs
  • Deploy a tier 3 SRE from this branch and save the transcript of the deployment logs
  • Complete the Security evaluation checklist from the deployment documentation

For major releases only

  • Confirm that a third party has carried out a full penetration test evaluating:
    1. external attack surface
    2. ability to exfiltrate data from the system
    3. ability to transfer data between SREs
    4. ability to escalate privileges on the SRD.

Update documentation

  • Update supported versions in SECURITY.md
  • Update pen test results in VERSIONING.md

Making the release

  • Merge release branch into latest
  • Create a tag of the form v0.0.1 pointing to the most recent commit on latest (the merge that you just made)
  • Publish your draft GitHub release using this tag
  • Ensure docs for the latest version are built and deployed on ReadTheDocs
  • Push a build to PyPI
  • Announce release on communications channels
  • Create a PR from latest into develop to ensure that release-specific changes are not lost

🌳 Deployment problems

Initial deployment of Tier 2 SRE

  • SSL certificate creation failed, but was successful on second attempt
  • Firewall failed to deploy
  • 2nd attempt also failed with new error:
Pulumi error:  +  azure-native:network:Route sre_firewall_route_via_firewall creating       
(9s) error: Code="MissingNextHopIpAddress" Message="NextHopIpAddress cannot be Null or Empty in     
route ViaFirewall when NextHopType is VirtualAppliance." 

Initial deployment of Tier 3 SRE failed

Diagnostics:                                                                                          
  azure-native:network:AzureFirewall (sre_firewall_firewall):                                            
    error: 1 error occurred:                                                                           
        * GET                                                                                                        
https://management.azure.com/subscriptions/3f1a8e26-eae2-4539-952a-0a6184ec248a/providers/Microsoft.Network/locations
/uksouth/operations/c9c47b66-593c-430d-819b-1433ef1d12ab                                                             
    --------------------------------------------------------------------------------                                 
    RESPONSE 200: 200 OK                                                                                             
    ERROR CODE: GatewayAllocationFailed                                                                              
    --------------------------------------------------------------------------------                                 
    {                                                                                                                
      "status": "Failed",                                                                                            
      "error": {                                                                                                     
        "code": "GatewayAllocationFailed",                                                                           
        "message": "Compute allocation failed. Please retry later.",                                                 
        "details": []                                                                                                
      }                                                                                                              
    }                                                                                                                
    --------------------------------------------------------------------------------                                 
                                                                                                                     
  azure-native:network:PrivateEndpoint (sre_data_storage_account_data_configuration_private_endpoint):   
    error: 1 error occurred:                                                                           
        * GET                                                                                                        
https://management.azure.com/subscriptions/3f1a8e26-eae2-4539-952a-0a6184ec248a/providers/Microsoft.Network/locations
/uksouth/operations/d0b31274-dd1f-4557-bd42-47bee6642b72                                                             
    --------------------------------------------------------------------------------                                 
    RESPONSE 200: 200 OK                                                                                             
    ERROR CODE: RetryableError                                                                                       
    --------------------------------------------------------------------------------                                 
    {                                                                                                                
      "status": "Failed",                                                                                            
      "error": {                                                                                                     
        "code": "RetryableError",                                                                                    
        "message": "A retryable error occurred.",                                                                    
        "details": [                                                                                                 
          {                                                                                                          
            "code": "ReferencedResourceNotProvisioned",                                                              
            "message": "Cannot proceed with operation because resource                                               
/subscriptions/3f1a8e26-eae2-4539-952a-0a6184ec248a/resourceGroups/shm-muppets-sre-rizzo-rg/providers/Microsoft.Netwo
rk/virtualNetworks/shm-muppets-sre-rizzo-vnet/subnets/DataConfigurationSubnet used by resource                       
/subscriptions/3f1a8e26-eae2-4539-952a-0a6184ec248a/resourceGroups/shm-muppets-sre-rizzo-rg/providers/Microsoft.Netwo
rk/networkInterfaces/shm-muppets-sre-rizzo-pep-storage-account-d.nic.59b9ffb7-8570-4b22-8836-f80344b66c35 is not in  
Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is       
PutSubnetOperation."                                                                                                 
          }                                                                                                          
        ]                                                                                                            
      }                                                                                                              
    }                                                                                                                
    -------------------------------------------------------------------------------- 
  • these issues may have been a temporary azure issue, as the deployment is now working
  • User in correct group is unable to log in to the workspace
  • possible DNS issue causing this. In the DNS logs, when a login attempt is made, the following is observed:
2025/01/16 16:47:21.538364 42#29868 [debug] filtering: found rule "*.*" for host "record.bauhqxgcxjnudchmhtgwk1hvig.zx.internal.cloudapp.net", filter list id: 0
2025/01/16 16:47:21.538380 42#29868 [debug] dnsforward: host "record.bauhqxgcxjnudchmhtgwk1hvig.zx.internal.cloudapp.net" is filtered, reason: "FilteredBlackList"
  • these problems were not replicated on fresh deployments of tier 2 and tier 3 SREs
  • the tier 3 suffered a series of problems during deployment - diagnostic settings for a variety of resources (e.g. storage accounts and firewall) were repeatedly reported as already existing and needed to be deleted before deployment could finish. after deployment, logging in would show no connections for a registered user. On inspection, this was because the guacamole-user-sync server could not make contact with the ldap server (apricot). The DNS entry for apricot was incorrect:

Image

Manually correcting the DNS record allowed guacamole-user-sync to contact the ldap server successfully, and appear to sync. But still no connections appeared for the registered user.

Tore down and redeployed the tier 3, everything functional.

@craddm craddm added the release candidate This is a candidate for release label Jan 15, 2025
@craddm craddm changed the title Release: v5.2.1 Release: v5.2.2 Jan 15, 2025
@craddm craddm changed the title Release: v5.2.2 Release: v5.3.0 Jan 15, 2025
@JimMadge JimMadge mentioned this issue Jan 15, 2025
3 tasks
@github-project-automation github-project-automation bot moved this to To Be Refined in Data Safe Haven Jan 15, 2025
@JimMadge JimMadge moved this from To Be Refined to In progress in Data Safe Haven Jan 15, 2025
@craddm craddm self-assigned this Jan 15, 2025
@JimMadge
Copy link
Member

I've been unable to reproduce the deployment problem above. Possibly an intermittent problem.

@JimMadge
Copy link
Member

2025/01/16 16:47:21.538364 42#29868 [debug] filtering: found rule "*.*" for host "record.bauhqxgcxjnudchmhtgwk1hvig.zx.internal.cloudapp.net", filter list id: 0

I would be surprised if this was a problem. There has always been a filter rule for *.* and the permitted domains shouldn't have changed, so I think this should be the same behaviour as before.

@jemrobinson
Copy link
Member

We should find out whether we need to allowlist *.internal.cloudapp.net - it's used for reverse DNS for VMs

@craddm craddm closed this as completed Jan 21, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in Data Safe Haven Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release candidate This is a candidate for release
Projects
Status: Done
Development

No branches or pull requests

3 participants