Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with AlmaLinux cluster on Azure due to DHCP problem #338

Open
ocaisa opened this issue Nov 22, 2024 · 6 comments
Open

Issue with AlmaLinux cluster on Azure due to DHCP problem #338

ocaisa opened this issue Nov 22, 2024 · 6 comments

Comments

@ocaisa
Copy link
Collaborator

ocaisa commented Nov 22, 2024

I tried to create a new cluster on Azure today using the same configuration that I used yesterday but while puppet configures successfully there seems to be a problem connecting to the cluster. I think it is a network issue:

Nov 22 12:24:08 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278248.9666] device (eth1): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Nov 22 12:24:08 aarch64-neoverse-n1-node1 NetworkManager[2446]: <warn>  [1732278248.9672] device (eth1): Activation: failed for connection 'Wired connection 1'
Nov 22 12:24:08 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278248.9674] device (eth1): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 22 12:24:08 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278248.9998] dhcp4 (eth1): canceled DHCP transaction
Nov 22 12:24:08 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278248.9998] dhcp4 (eth1): activation: beginning transaction (timeout in 300 seconds)
Nov 22 12:24:08 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278248.9999] dhcp4 (eth1): state changed no lease
Nov 22 12:24:09 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278249.0013] device (eth1): Activation: starting connection 'Wired connection 1' (141332cd-12b8-3bcc-9242-b6af1fbcdb71)
Nov 22 12:24:09 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278249.0014] device (eth1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Nov 22 12:24:09 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278249.0016] device (eth1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Nov 22 12:24:09 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278249.0021] device (eth1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Nov 22 12:24:09 aarch64-neoverse-n1-node1 NetworkManager[2446]: <info>  [1732278249.0023] dhcp4 (eth1): activation: beginning transaction (timeout in 300 seconds)

and I notice there is a very recent issue similar to it upstream at RedHat: https://access.redhat.com/solutions/7002283

@ocaisa
Copy link
Collaborator Author

ocaisa commented Nov 22, 2024

Ah, no, seems like the error is elsewhere:

-agent[1137]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[kernel_venv]/returns)   × No interpreter found in managed installations or system path
-agent[1137]: 'uv venv --seed --python-preference system /opt/ipython-kernel-eessi' returned 1 instead of one of [0]
-agent[1137]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[kernel_venv]/returns) change from 'notrun' to ['0'] failed: 'uv venv --seed --python-preference system /opt/ipyth>

@ocaisa
Copy link
Collaborator Author

ocaisa commented Nov 22, 2024

There was a release of uv two days ago (and I may have created my node image before it was pulled in). Could that be it?

@ocaisa
Copy link
Collaborator Author

ocaisa commented Nov 22, 2024

Hmm, that is also not it:

[centos@aarch64-neoverse-n1-node1 uv]$ /opt/uv/bin/uv --version
uv 0.4.22

@ocaisa
Copy link
Collaborator Author

ocaisa commented Nov 22, 2024

-agent[10516]: Requesting catalog from mgmt1:8140 (10.0.1.6)
-agent[10516]: Catalog compiled by mgmt1
-agent[10516]: (/Stage[main]/Profile::Freeipa::Client/Exec[set_hostname]/returns) executed successfully (corrective)
-agent[10516]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[kernel_venv]/returns)   × No interpreter found in managed installations or system path
-agent[10516]: 'uv venv --seed --python-preference system /opt/ipython-kernel-eessi' returned 1 instead of one of [0]
-agent[10516]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[kernel_venv]/returns) change from 'notrun' to ['0'] failed: 'uv venv --seed --python-preference system /opt/ipyt>
-agent[10516]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[pip_ipykernel]) Dependency Exec[kernel_venv] has failures: true
-agent[10516]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[pip_ipykernel]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/ipython-kernel-eessi/etc]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/ipython-kernel-eessi/etc/ipython]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/ipython-kernel-eessi/etc/ipython/ipython_config.py]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Profile::Freeipa::Client/Exec[ipa-install]/returns) [output redacted]
-agent[10516]: [command redacted] returned 2 instead of one of [0]
-agent[10516]: (/Stage[main]/Profile::Freeipa::Client/Exec[ipa-install]/returns) change from 'notrun' to ['0'] failed: [command redacted] returned 2 instead of one of [0] >
-agent[10516]: (/Stage[main]/Profile::Freeipa::Base/Service[systemd-logind]) Dependency Exec[ipa-install] has failures: true
-agent[10516]: (/Stage[main]/Profile::Freeipa::Base/Service[systemd-logind]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Profile::Freeipa::Client/File_line[ssh_known_hosts]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Profile::Freeipa::Client/Exec[selinux_login_default]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Profile::Freeipa::Client/Augeas[selinux_provider]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Profile::Sssd::Client/Service[sssd]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Profile::Slurm::Node/Service[slurmd]) Skipping because of failed dependencies
-agent[10516]: (/Stage[main]/Profile::Slurm::Node/Exec[systemctl restart slurmd]) Skipping because of failed dependencies
-agent[10516]: Applied catalog in 127.82 seconds

@ocaisa
Copy link
Collaborator Author

ocaisa commented Nov 22, 2024

The puppet log from a fresh boot looks like:

Nov 22 13:24:09 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Software_stack/Consul::Service[software_stack]/File[/etc/consul/service_software_stack.json]/ensure) defined content as '{sha256}ef1d396432872167ace5bd69f9f2d7ae78591bcfa47eced3c5e2c44424cbf586'
Nov 22 13:24:09 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Consul::Reload_service/Exec[reload consul service]) Triggered 'refresh' from 1 event
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[kernel_venv]/returns)   × No interpreter found in managed installations or system path
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: 'uv venv --seed --python-preference system /opt/ipython-kernel-eessi' returned 1 instead of one of [0]
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[kernel_venv]/returns) change from 'notrun' to ['0'] failed: 'uv venv --seed --python-preference system /opt/ipython-kernel-eessi' returned 1 instead of one of [0]
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[pip_ipykernel]) Dependency Exec[kernel_venv] has failures: true
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/Exec[pip_ipykernel]) Skipping because of failed dependencies
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/ipython-kernel-eessi/etc]) Skipping because of failed dependencies
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/ipython-kernel-eessi/etc/ipython]) Skipping because of failed dependencies
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/ipython-kernel-eessi/etc/ipython/ipython_config.py]) Skipping because of failed dependencies
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/jupyterhub/share/jupyter/kernels/python3/kernel.json]/content) content changed '{sha256}ebdcb9aaca71bac28a49b29fa58c3ce82e3f505574d4053a36d46e522f6dfa42' to '{sha256}3e9a21389aa7cb2ee22c35900e7927a41eaa1cd9af29b422aede2d3cdcd20a33'
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Jupyterhub::Kernel::Venv/File[/opt/jupyterhub/share/jupyter/kernels/python3/kernel.json]/seltype) seltype changed 'cache_home_t' to 'usr_t'
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Users::Local/Profile::Users::Local_user[centos]/Selinux::Exec_restorecon[/centos]/Exec[selinux::exec_restorecon /centos]) Triggered 'refresh' from 1 event
Nov 22 13:24:10 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Client/Wait_for[ipa_records]) Triggered 'refresh' from 1 event
Nov 22 13:24:40 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Client/Wait_for[ipa-ca_https]) Triggered 'refresh' from 1 event
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Client/Exec[ipa-install]/returns) [output redacted]
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: [command redacted] returned 2 instead of one of [0]
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Client/Exec[ipa-install]/returns) change from 'notrun' to ['0'] failed: [command redacted] returned 2 instead of one of [0]
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Base/Service[systemd-logind]) Dependency Exec[ipa-install] has failures: true
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Base/Service[systemd-logind]) Skipping because of failed dependencies
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Client/File_line[ssh_known_hosts]) Skipping because of failed dependencies
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Client/Exec[selinux_login_default]) Skipping because of failed dependencies
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Freeipa::Client/Augeas[selinux_provider]) Skipping because of failed dependencies
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Sssd::Client/Service[sssd]) Skipping because of failed dependencies
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Prometheus::Node_exporter/Prometheus::Daemon[node_exporter]/Systemd::Unit_file[node_exporter.service]/File[/etc/systemd/system/node_exporter.service]/ensure) defined content as '{sha256}7278874c4ecc8a47d23363bc3370888e0dd959987902309d16ea142ab78045f6'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Logrotate::Config/Logrotate::Cron[daily]/File[/etc/cron.daily/logrotate]/ensure) defined content as '{sha256}5e9cb8fcc8653356515b6c6f7f3d026f5f6b124faa624f870894606108b5ed7b'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Slurm::Node/Logrotate::Rule[slurmd]/File[/etc/logrotate.d/slurmd]/ensure) defined content as '{sha256}fbde68107fc44f45b839cee091f06c3ea7f9a30c904aa846e402f9932e269bcd'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Logrotate::Defaults/Logrotate::Conf[/etc/logrotate.conf]/File[/etc/logrotate.conf]/content) content changed '{sha256}96fe9ec9ad3f0cee5bfc6de806fda99d7f9e519b738808590824c775c69532eb' to '{sha256}3b0d0c652e1fda2912137ec906ecf256ac75aafa8013aa27ff08a9869a1cbff6'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Logrotate::Defaults/Logrotate::Rule[wtmp]/File[/etc/logrotate.d/wtmp]/content) content changed '{sha256}81eec3d3e01e4263ffc6aab6aeee74f3b260f1df4c692525312608f7a60db048' to '{sha256}2999a3734c4d7501b4041b36b26447bce9ec7310987c8b4e5ded0ac06fe5c8e5'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Logrotate::Defaults/Logrotate::Rule[btmp]/File[/etc/logrotate.d/btmp]/content) content changed '{sha256}7f64bae051c1727236f9dabe11526f1d1aad26aa53c1d0b8425efee746f5fbe8' to '{sha256}ac29498efdf341b601aa0e436bdf7ab14ca8e6122a62d81070cc1455cc1a0b82'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/home]/Mount[shared /home by 10.0.1.6 on /home]/ensure) defined 'ensure' as 'defined'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/home]/Mount[shared /home by 10.0.1.6 on /home]) Triggered 'refresh' from 1 event
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/project]/Nfs::Functions::Mkdir[/project]/Exec[mkdir_recurse_/project]/returns) executed successfully
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/project]/Mount[shared /project by 10.0.1.6 on /project]/ensure) defined 'ensure' as 'defined'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/project]/Mount[shared /project by 10.0.1.6 on /project]) Triggered 'refresh' from 1 event
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/scratch]/Nfs::Functions::Mkdir[/scratch]/Exec[mkdir_recurse_/scratch]/returns) executed successfully
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/scratch]/Mount[shared /scratch by 10.0.1.6 on /scratch]/ensure) defined 'ensure' as 'defined'
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Nfs::Client::Mount[/scratch]/Mount[shared /scratch by 10.0.1.6 on /scratch]) Triggered 'refresh' from 1 event
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Systemd::Daemon_reload[nfs-automount]/Exec[systemd-nfs-automount-systemctl-daemon-reload]) Triggered 'refresh' from 1 event
Nov 22 13:26:43 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Nfs::Client/Exec[systemctl restart remote-fs.target]) Triggered 'refresh' from 1 event
Nov 22 13:26:44 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Consul_template::Service/Service[consul-template]/ensure) ensure changed 'stopped' to 'running'
Nov 22 13:26:44 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Slurm::Base/Wait_for[slurmctldhost_set]) Triggered 'refresh' from 1 event
Nov 22 13:26:44 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Prometheus::Node_exporter/Prometheus::Daemon[node_exporter]/Systemd::Unit_file[node_exporter.service]/Systemd::Daemon_reload[node_exporter.service]/Exec[systemd-node_exporter.service-systemctl-daemon-reload]) Triggered 'refresh' from 1 event
Nov 22 13:26:45 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Prometheus::Node_exporter/Prometheus::Daemon[node_exporter]/Service[node_exporter]/ensure) ensure changed 'stopped' to 'running'
Nov 22 13:26:45 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Slurm::Node/Service[slurmd]) Skipping because of failed dependencies
Nov 22 13:26:45 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: (/Stage[main]/Profile::Slurm::Node/Exec[systemctl restart slurmd]) Skipping because of failed dependencies
Nov 22 13:26:45 aarch64-neoverse-n1-node1.int.azure-alma94-2411-mc-14.dev.eessi puppet-agent[1160]: Applied catalog in 330.21 seconds

@ocaisa
Copy link
Collaborator Author

ocaisa commented Nov 22, 2024

Interestingly, the problem only seems to occur on the Arm node, the Zen4 node came up just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant