Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path clash with Azure local storage #2221

Closed
5 tasks done
craddm opened this issue Oct 3, 2024 · 27 comments · Fixed by #2223
Closed
5 tasks done

Path clash with Azure local storage #2221

craddm opened this issue Oct 3, 2024 · 27 comments · Fixed by #2223
Assignees
Labels
bug Problem when deploying a Data Safe Haven.
Milestone

Comments

@craddm
Copy link
Contributor

craddm commented Oct 3, 2024

✅ Checklist

  • I have searched open and closed issues for duplicates.
  • This is a problem observed when deploying a Data Safe Haven.
  • I can reproduce this with the latest version.
  • I have read through the documentation.
  • This isn't an open-ended question (open a discussion if it is).

💻 System information

  • Operating System: Debian bookworm
  • Data Safe Haven version: develop

📦 Packages

List of packages
acme==2.10.0
annotated-types==0.7.0
appdirs==1.4.4
Arpeggio==2.0.2
attrs==24.2.0
azure-common==1.1.28
azure-core==1.31.0
azure-identity==1.18.0
azure-keyvault-certificates==4.8.0
azure-keyvault-keys==4.9.0
azure-keyvault-secrets==4.8.0
azure-mgmt-compute==33.0.0
azure-mgmt-containerinstance==10.1.0
azure-mgmt-core==1.4.0
azure-mgmt-dns==8.1.0
azure-mgmt-keyvault==10.3.1
azure-mgmt-msi==7.0.0
azure-mgmt-rdbms==10.1.0
azure-mgmt-resource==23.1.1
azure-mgmt-storage==21.2.1
azure-storage-blob==12.23.1
azure-storage-file-datalake==12.17.0
azure-storage-file-share==12.18.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
chevron==0.14.0
click==8.1.7
cryptography==43.0.1
-e git+https://github.com/craddm/data-safe-haven.git@a6a6993ad7bbcc02e8f05629fe0fd5ab7154a900#egg=data_safe_haven
debugpy==1.8.6
dill==0.3.9
dnspython==2.6.1
fqdn==1.5.1
grpcio==1.60.2
idna==3.10
isodate==0.6.1
josepy==1.14.0
markdown-it-py==3.0.0
mdurl==0.1.2
msal==1.31.0
msal-extensions==1.2.0
msrest==0.7.1
oauthlib==3.2.2
parver==0.5
portalocker==2.10.1
protobuf==4.25.5
psycopg==3.2.3
pulumi==3.134.1
pulumi_azure_native==2.63.0
pulumi_random==4.16.6
pycparser==2.22
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
PyJWT==2.9.0
pyOpenSSL==24.2.1
pyRFC3339==1.1
pytz==2024.2
PyYAML==6.0.2
requests==2.32.3
requests-oauthlib==2.0.0
rich==13.8.1
semver==2.13.0
setuptools==75.1.0
shellingham==1.5.4
simple_acme_dns==3.1.0
six==1.16.0
typer==0.12.5
typing_extensions==4.12.2
urllib3==2.2.3
validators==0.28.3
websocket-client==1.8.0

🚫 Describe the problem

Virtual machine sizes with local storage (mounted at /mnt) clash with NFS mounts defined in /etc/fstab.

🌳 Log messages

Relevant log messages

image

♻️ To reproduce

Deploy a workspace using a VM size with local storage.

@craddm craddm added the bug Problem when deploying a Data Safe Haven. label Oct 3, 2024
@jemrobinson
Copy link
Member

I saw the same problem when we first tested #2092. @JimMadge : can you take a look?

@craddm
Copy link
Contributor Author

craddm commented Oct 3, 2024

Same missing setting you report there - ldap missing from group and passwd image

@JimMadge
Copy link
Member

JimMadge commented Oct 3, 2024

Looks like the problem was you already have a disk mounted at /mnt. What size are you using?

@jemrobinson
Copy link
Member

jemrobinson commented Oct 3, 2024

@JimMadge : Is something automatically mounted at /mnt? Possibly the (badly documented) temp disk that lives on the same physical machine as the VM?

This might also explain why mounting at /shared was fine but /mnt/shared causes a problem.

@craddm
Copy link
Contributor Author

craddm commented Oct 3, 2024

Standard_D2s_v3

btw I literally just got it running by creating the /mnt subdirectories from the console. Maybe adding the -m flag to mount -fav, so it creates the directories if they don't exist, will fix it?

@JimMadge
Copy link
Member

JimMadge commented Oct 3, 2024

Hmm, I'm not sure. That size shouldn't have a temporary disk.

You can see in the fstab output though that there is device /dev/disk/cloud/azure_resource-part-1 at /mnt. That doesn't look right to me.
I haven't seen this problem in any deployments I've done recently.

@jemrobinson
Copy link
Member

jemrobinson commented Oct 3, 2024

Is it worth abandoning /mnt and putting our drives somewhere else?

Or alternatively, explicitly adding something like the following to /etc/fstab

/dev/disk/cloud/azure_resource-part1	/mnt/tmp	auto	defaults,nofail,_netdev	0	2

@JimMadge
Copy link
Member

JimMadge commented Oct 3, 2024

I'd rather understand what is happening here. It isn't a mount that we have defined and I'm not sure what it is.

If it is something Azure is adding we could make sure to remove entries like that from fstab (if that is the problem here).

@jemrobinson
Copy link
Member

jemrobinson commented Oct 3, 2024

This end of this section says "For Linux VMs, the temporary disk is /dev/sdb1 and is mounted at /mnt/resource or /mnt.", so I'm not sure we can disable this.

According to this outdated answer it might be possible to change in waagent.conf. However, I'm not confident that we could use cloud-init to change that file since waagent is used to run cloud-init.

Otherwise, we should see whether adding an explicit mount point for /dev/disk/cloud/azure_resource-part1 will fix it (as above).

@JimMadge
Copy link
Member

JimMadge commented Oct 3, 2024

Oh wait, the Dvs3 series do have temporary disks.

Is there a reason to use such an old offering?

@jemrobinson
Copy link
Member

I think we should be robust against the use of VM SKUs that have temporary disks (which is most of them) regardless of whether Dvs3 is a sensible SKU to use.

@craddm
Copy link
Contributor Author

craddm commented Oct 3, 2024

It's just the one that was our default on the old codebase, so I still use it as a default out of habit, but as @jemrobinson, should be robust to stuff like this IMO. (I think the GPU VMs we currently recommend -e.g. Stanard_NC6s_v3 also have local temp disks?)

@JimMadge
Copy link
Member

JimMadge commented Oct 3, 2024

(which is most of them)

I'm not sure if that is true, it tends to be the older offerings.
Looks like the high performance sizes and those with accelerators have local disks though, sometimes NVMe.

@JimMadge JimMadge added this to the Release 5.0.1 milestone Oct 3, 2024
@JimMadge JimMadge changed the title Mount point does not exist error on VMs Path clash with Azure local storage Oct 3, 2024
@jemrobinson
Copy link
Member

From here

Most VMs contain a temporary disk, which is not a managed disk.

@JimMadge
Copy link
Member

JimMadge commented Oct 4, 2024

I'm sure it has been true but the trend in the current and new general purpose sizes is to not include local storage and provide 'd' series variants for those that want it. Noting that line has been in the docs for quite a few years. That said, I don't think I want to enumerate all the available sizes or sizes*availability 😅.

It does seem common on sizes with accelerators though. That makes sense as the users would likely want fast storage, and the physical nodes in the data centre are more likely to have onboard storage.

We'll need to fix this to enable GPU/FPGA sizes.

@craddm
Copy link
Contributor Author

craddm commented Oct 4, 2024

Adding -o X-mount.mkdir fixed this on the Standard_D2s_v3 at least, although I'm still unable to login

e.g. while (! mountpoint -q /mnt/input); do sleep 5; mount -o X-mount.mkdir /mnt/input; done

@JimMadge
Copy link
Member

JimMadge commented Oct 4, 2024

What does that option do?

I think I'd rather just not mount local storage at /mnt so that we are consistent with machine with and without local storage.

@craddm
Copy link
Contributor Author

craddm commented Oct 4, 2024

It creates the directory if it doesn't exist

@jemrobinson
Copy link
Member

Alias --mkdir might be more clear?

@craddm
Copy link
Contributor Author

craddm commented Oct 4, 2024

Alias --mkdir might be more clear?

it would be, but for some reason that's not a supported option on the machine I tested

@jemrobinson
Copy link
Member

This doesn't really deal with the question of what happens in these scenarios:

  • temp disk mounted at /mnt
  • data mounted at /mnt/ingress
  • temp disk unmounted

OR

  • data mounted at /mnt/ingress
  • temp disk mounted at /mnt

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

@JimMadge
Copy link
Member

JimMadge commented Oct 4, 2024

I think the best solution would be that we standardise where our mounts and temp disk(s) are in all cases.

I like having our mounts at /mnt, it feels idiomatic. I would put local disks at something like /scratch, /var/scratch, /mnt/scratch.

We might need some logic like "if /dev/disk/cloud/... then ..."

@jemrobinson
Copy link
Member

jemrobinson commented Oct 4, 2024

Agreed that /var/scratch or /mnt/scratch makes sense for the temp disk. Have you tried the fstab line I put above?

@craddm
Copy link
Contributor Author

craddm commented Oct 4, 2024

This doesn't really deal with the question of what happens in these scenarios:

  • temp disk mounted at /mnt
  • data mounted at /mnt/ingress
  • temp disk unmounted

OR

  • data mounted at /mnt/ingress
  • temp disk mounted at /mnt

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

Looks like it's possible to change the location of the temp disk in /etc/waagent.conf:

There's a field ResourceDisk.MountPoint which is set to /mnt on our VMs

https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/agent-linux

@jemrobinson
Copy link
Member

We might be going in circles here, but since waagent is used to run cloud-init, we probably can't use a cloud-init command to change the configuration.

@craddm craddm self-assigned this Oct 4, 2024
@JimMadge
Copy link
Member

JimMadge commented Oct 4, 2024

Can we give arguments to waagent when deploying the machine?

@craddm
Copy link
Contributor Author

craddm commented Oct 4, 2024

So, we can add ephemeral0 to the mounts: section of our cloud-init (see here.

[ephemeral0, null] effectively doesn't mount it
[ephemeral0, /mnt/resource] mounts it to /mnt/resource. So we could mount it to /mnt/scratch or whatever you'd prefer. Works whether the VM really has an ephemeral disk or not - just an empty folder when there is no real extra disk.

At this point the desktop icons for input, output and shared are not working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Problem when deploying a Data Safe Haven.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants