Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workspace packages #1892

Merged
merged 116 commits into from
Jul 23, 2024
Merged

Add workspace packages #1892

merged 116 commits into from
Jul 23, 2024

Conversation

JimMadge
Copy link
Member

@JimMadge JimMadge commented May 16, 2024

✅ Checklist

  • You have given your pull request a meaningful title (e.g. Enable foobar integration rather than 515 foobar).
  • You are targeting the appropriate branch. If you're not certain which one this is, it should be develop.
  • Your branch is up-to-date with the target branch (it probably was when you started, but it may have changed since then).

🚦 Depends on

⤴️ Summary

  • Add ansible playbook for desired state
    • Install apt packages
    • Configures shell/prompts
  • Remove FileUpload dynamic provider

I have moved some, but not all desired state here.
There were some files templated in cloud-init which need data from Pulumi/configuration.
We can do that in Ansible, but we should think about how to we get the data to the workspaces.

  • Template files locally and send them
  • Pass variable to Ansible process?

There are some places where we could take advantage of the "official" Ansible community modules, or other packages from Ansible Galaxy. That will also require a bit of thought though as ansible-galaxy is really a package manager, so would be difficult to use in a locked down SRE.

Further Work

  • Using Ansible galaxy collections/roles
  • Testing (molecule?)
  • Auditd configuration
  • Clamav configuration
  • Packages which require some manual work
    • RStudio
    • Nvidia drivers
    • Azure Data Studio and sqlcmd (possibly)
  • Developer docs (how to change configuration and run using az CLI)
  • Tightening networking rules
  • Fix release lookup (and remove old releases)

🌂 Related issues

Closes #1574

🔬 Tests

Tested in a new deployment,

  • Desired state container mounted correctly
  • Playbook runs
  • Systemd module installed
  • Systemd module works
  • Systemd timer installed and active
  • Manually checked modified files, systemd services, etc.
  • Manually checked requested packages were installed

Copy link

github-actions bot commented May 16, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  data_safe_haven/infrastructure/common
  ip_ranges.py
  data_safe_haven/infrastructure/components/composite
  __init__.py
  virtual_machine.py
  data_safe_haven/infrastructure/components/dynamic
  __init__.py
  entra_application.py 115
  data_safe_haven/infrastructure/programs/sre
  data.py 82, 468-592, 775, 923
  networking.py 402, 1464, 1852
  workspaces.py 67, 126
  data_safe_haven/types
  enums.py
Project Total  

This report was generated by python-coverage-comment-action

@JimMadge JimMadge mentioned this pull request May 17, 2024
5 tasks
@JimMadge
Copy link
Member Author

Looking at some options to get the files on workspace VMs. The constraints are,

  • Needs to happen early after deployment
    • Will have Linux utils and some basics from cloud-init (curl, bash, sed, but not az)
  • Should happen on all workspaces
    • Easier to avoid authentication if possible
  • Data is non-sensitive
    • No need for authentication for read access

The good solutions I can see are,

  1. Blob container with anonymous read access
  • Use the existing data_configuration storage account (is that appropriate @jemrobinson)
  • Could create a new storage account if we are worried about enabling anonymous access
  • Anonymous read access
  • Networking rules only allow access from inside SRE
  • Upload a manifest file (as you need list permissions to list all files)
  • Script to fetch manifest, then fetch all files in the manifest using curl
  1. Blob container with sftp access
  • Storage account needs hierarchical namespace
  • Enable SFTP
  • Create local user with read/list permissions
    • With machine password for all workspaces or,
    • with ssh keypairs generated on each workspace
  • Script to scp from sftp using local account/password/pubkey

Non-options,

  1. NFS: No easy way for us/admins to put files into NFS share
  2. BlobFuse or SMB: more pre-requisites

Using anonymous access blob and http is nice for not needing authentication. However, you have to keep the manifest up to date which could be prone to error.
SFTP might be the best option, for the cost of storing the SFTP password and writing it to each workspace VM.

Thoughts @jemrobinson @craddm

@jemrobinson
Copy link
Member

jemrobinson commented May 21, 2024

Is the cloud-init Ansible support (https://cloudinit.readthedocs.io/en/latest/reference/modules.html#ansible) any help here?

Some thoughts on your suggestions:

  1. data_configuration storage account
  • Access is allowed from known data_configuration_ip_addresses
  • Currently doesn't have any blob containers (only file shares)
  • See e.g. the way we mount ingress and egress from blob storage over NFSv3
  1. Enabling SFTP
  • This is a bit of a mess, but @craddm got it working for a DSG in the past (hopefully notes on how to do it are somewhere)
  • Not sure I would recommend it

How/when would the file get updated? Is it reasonable to imagine e.g. an Azure Function that would do the updating? If so, perhaps it could write directly to the VM and you wouldn't need a storage volume at all?

@craddm
Copy link
Contributor

craddm commented May 21, 2024

I do have some notes on SFTP, yes. Wasn't enormously tricky, IIRC. Creating a local user was easy through the portal, and seems from Powershell it was easy to create one with a password rather than an SSH key.

I'm not seeing much about Azure Functions writing directly to VMs; you can trigger them on uploads to blob storage but so far I've only seen people suggest that the function copies files to a File Share

@JimMadge
Copy link
Member Author

@jemrobinson How much experience with key pairs in Azure/Pulumi do you have?

Looks like your can't generate a local user with password using Pulumi (at least, the password isn't an output).
You can specify an existing key pair though. That would mean we would need to be able to fetch the private key for each workspace.

@jemrobinson
Copy link
Member

What Pulumi resource are you using to generate a local user? I assume you don't mean in cloud-init?

@JimMadge
Copy link
Member Author

azure_native.storage.LocalUser

Local user for SFTP, not for the workspaces.

@JimMadge
Copy link
Member Author

@jemrobinson Looks like Azure expects to keep the private key for keys in a keyvault.

I think the best option would be to use the TLS package to generate a key, then store that as a secret.
(I'm now vaguely recalling using the TLS package before to do something like this).

@jemrobinson
Copy link
Member

I'm still a bit confused about what advantage we get for using SFTP over one of the other options. Is it just that we can do so earlier in the cloud-init order of operations? If so, why does this matter?

N.B. if you want to generate a Key in a Keyvault, I think you should probably use the Key resource

@JimMadge
Copy link
Member Author

The SFTP way means,

  • No requirements beyond what is already on a bare Ubuntu VM image
  • No need to have a manifest (you can just scp -r user@host:* /)

The current downside is it means sharing a private key between workspaces (or creating a new local account with key pair for each workspace).

@jemrobinson
Copy link
Member

jemrobinson commented Jul 21, 2024

  1. Cannot log in via GUI with user credentials (grey error box that usually indicates incorrect username/password)
Screenshot 2024-07-21 at 13 19 29
  1. Cannot log in via serial console with admin credentials (not recognised)
Screenshot 2024-07-21 at 13 19 57

@jemrobinson
Copy link
Member

Also blob creation in the desired state account is failing on a fresh deployment:

  azure-native:storage:Blob (sre_data_blob_desired_state_blob_run_all_tests.bats):
    error: retrieving blob properties "files/home/dshadmin/tests/run_all_tests.bats" (container "desiredstate" / account "shgresrebotdesiredstatec"): blobs.Client#GetProperties: Failure responding to request: StatusCode=403 -- Original Error:
autorest/azure: error response cannot be parsed: {"" '\x00' '\x00'} error: EOF

@JimMadge
Copy link
Member Author

I did see that serial console problem. I did a fresh deployment and it didn't occur again. I assumed it was some problem unrelated to the changes here.

Haven't seen the azure-native Blob error. Looks like an authentication problem.

My output above was from a fresh sre deployment at 226bebd.

@jemrobinson
Copy link
Member

I think the blob issue is actually a Storage Account issue, where sometimes the network permissions on the storage account ("IP address A.B.C.D can access this account") don't translate into actually being able to access the account from that IP address. Worth moving this into its own issue as I don't think it's related to this PR.

@JimMadge
Copy link
Member Author

New deployment at 46af571, Pulumi ran without error, Ansible is now running on the workspace VM.

@JimMadge
Copy link
Member Author

Replicated serial console and guacamole login problems.

#2028 is possibly causing guacamole login problems (redeployed Entra ID App doesn't have the correct token to authenticate with LDAP).

# ldapsearch -H "ldap://identity.oda.daimyo.develop.turingsafehaven.ac.uk:1389"
ldap_sasl_interactive_bind: Protocol error (2)
        additional info: LDAP search request failed. Failed to fetch bearer token from OAuth endpoint.
(unauthorized_client) AADSTS700016: Application with identifier 'd13c364d-992e-4ff2-8911-40c6646c82fd' was not found in the directory 'Turing Data Safe Haven (Green)'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant. Trace ID: 02613c3e-331e-4e2c-b034-1b317be65200 Correlation ID: 302e5952-a067-4456-95d8-4500577d0602 Timestamp: 2024-07-22 10:49:52Z

@JimMadge
Copy link
Member Author

For serial console issue, it may be a sporadic problem.

I will try,

  • Adding workspaces
  • Running deployment without cloudinit and Ansible

I would be surprised if it were cloudinit or Ansible (although I can't think what it is) as they shouldn't modify that user and using the portal password reset (either to change dshadmin's password, or create a new user) doesn't work.

@JimMadge
Copy link
Member Author

JimMadge commented Jul 22, 2024

Some progress,

I can log in to the serial console using the dshadmin account before cloudinit has finished running. That sessions works as long as it is active. If you log out after cloudinit has finished, wait until it has finished before logging in, restart the VM, you get the login incorrect error.

That would suggest there is something in cloud init/ansible that breaks the account.
(Still strange that resetting doesn't appear to work).

@JimMadge
Copy link
Member Author

In a VM where I had a session,

  • # su dshadmin works (mimic user without password)
  • login dshadmin does not work, login incorrect error
  • # passwd dshadmin lets you change the password, login still fails.

@JimMadge
Copy link
Member Author

Suspect it may be a PAM configuration issue.

@JimMadge
Copy link
Member Author

Before and after correcting /etc/pam.d/common-session,

Jul 22 14:51:57 shm-daimyo-sre-oda-vm-workspace-01 login[100118]: PAM (other) illegal module type: pam_mkhomedir.so
Jul 22 14:51:57 shm-daimyo-sre-oda-vm-workspace-01 login[100118]: PAM pam_parse: expecting return value; [...skel=/etc/skel]
Jul 22 14:52:01 shm-daimyo-sre-oda-vm-workspace-01 login[100118]: FAILED LOGIN (1) on '/dev/pts/0' FOR 'dshadmin', Permission denied
Jul 22 14:52:09 shm-daimyo-sre-oda-vm-workspace-01 login[100118]: FAILED LOGIN (2) on '/dev/pts/0' FOR 'dshadmin', Permission denied
Jul 22 14:56:49 shm-daimyo-sre-oda-vm-workspace-01 login[101098]: pam_unix(login:session): session opened for user dshadmin(uid=1000) by dshadmin(uid=0)
Jul 22 14:56:54 shm-daimyo-sre-oda-vm-workspace-01 login[101098]: pam_unix(login:session): session closed for user dshadmin
Jul 22 15:02:17 shm-daimyo-sre-oda-vm-workspace-01 login[102510]: pam_unix(login:session): session opened for user dshadmin(uid=1000) by dshadmin(uid=0)
Jul 22 15:02:22 shm-daimyo-sre-oda-vm-workspace-01 login[102510]: pam_unix(login:session): session closed for user dshadmin

@jemrobinson
Copy link
Member

jemrobinson commented Jul 22, 2024

For reference, here's /etc/pam.d/common-session from an SRE deployed from develop

#
# /etc/pam.d/common-session - session-related modules common to all services
#
# This file is included from other service-specific PAM config files,
# and should contain a list of modules that define tasks to be performed
# at the start and end of interactive sessions.
#
# As of pam 1.0.1-6, this file is managed by pam-auth-update by default.
# To take advantage of this, it is recommended that you configure any
# local modules either before or after the default block, and use
# pam-auth-update to manage selection of other modules.  See
# pam-auth-update(8) for details.

# here are the per-package modules (the "Primary" block)
session [default=1]                     pam_permit.so
# here's the fallback if no module succeeds
session requisite                       pam_deny.so
# prime the stack with a positive return value if there isn't one already;
# this avoids us returning an error just because nothing sets a success code
# since the modules above will each just jump around
session required                        pam_permit.so
# The pam_umask module will set the umask according to the system default in
# /etc/login.defs and user settings, solving the problem of different
# umask settings with different shells, display managers, remote sessions etc.
# See "man pam_umask".
session optional                        pam_umask.so
# and here are more per-package modules (the "Additional" block)
session required                        pam_unix.so 
session [success=ok default=ignore]     pam_ldap.so minimum_uid=1000
session optional                        pam_systemd.so 
session optional                        pam_mkhomedir.so skel=/etc/skel umask=0022
# end of pam-auth-update config

Can you compare to the one deployed from this branch?

@JimMadge
Copy link
Member Author

JimMadge commented Jul 22, 2024

@jemrobinson The same (now) 👍
Before there was a line starting with pam_mkhomedir.so.

@jemrobinson
Copy link
Member

I can log into the serial console with these changes and #2034 fixes the Entra application issue. Looking into a problem with the apricot identity server.

@jemrobinson
Copy link
Member

jemrobinson commented Jul 23, 2024

@JimMadge : Interactive login is now working (with the fix from #2034 applied) but I'm getting a prompt that needs sudo approval. Can we get ansible to do this or alternatively, remove it if it's not needed?

Screenshot 2024-07-23 at 08 29 55

There's a potential solution here: https://www.reddit.com/r/Ubuntu/comments/15stmwn/how_do_i_suppress_authentication_is_required_to/ but I haven't tested it.

Captured in #2035

@JimMadge
Copy link
Member Author

@jemrobinson Excellent, thanks for testing.

I'm sure we had that error before and fixed in in v3-4. We should have a look in the old cloud-init, issues/PRs.
I assume you can close the prompt and ignore it safely?

Can we,

  1. Merge this and Stop unnecessary resource recreation #2034
  2. Open an issue for the prompt?

@jemrobinson
Copy link
Member

jemrobinson commented Jul 23, 2024

Screenshot 2024-07-23 at 08 50 47

looks like we don't have python-is-python3 installed.

@JimMadge
Copy link
Member Author

I've moved python-is-python3 to the common list in dfda85d

Copy link
Member

@jemrobinson jemrobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with this. There are some remaining problems that have been pulled out into their own issues.

@JimMadge JimMadge merged commit bc6664f into develop Jul 23, 2024
11 checks passed
@JimMadge JimMadge deleted the workspace_software branch July 23, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minimal required workspace software
3 participants