How to reboot if you only have one worker node / How to recover after power failure #14037

Rudi3 · 2024-04-06T21:30:59Z

Rudi3
Apr 6, 2024

I have one control plane node and one worker node.
How am I supposed to safely reboot the worker node? Is it safe to run kubectl drain <node-name> --ignore-daemonsets --delete-local-data?

I feared that the osd might not be restored if local data on the worker is erased.
Not knowing, I decided to just reboot the worker node - but it ended up being more of a force power off.

Currently, it looks like the osd is being detected by rook-ceph-osd-prepare:

cephosd: devices = [{ID:0 Cluster:ceph UUID:<the uuid> DevicePartUUID: DeviceClass:ssd BlockPath:/dev/sda MetadataPath: WalPath: SkipLVRelease:true Location:root=default host=mynode LVBackedPV:false CVMode:raw Store:bluestore TopologyAffinity: Encrypted:false ExportService:false NodeName: PVCName:}]

But rook-ceph-osd can't start. It looks like rook-ceph-osd is trying to open the wrong device (sde) (like https://github.com/rook/rook/pull/11567/files). The device on the host is sda. Is it possible to recover from this?

Is drain harmless even if you have a single worker node? If yes, I could run it to see if it fixes the issue. Otherwise, I may have found a bug that happens after a "power failure" of a single node cluster.

Here is the relevant output of the activate container:

kubectl -n rook-ceph logs rook-ceph-osd-0-<id> -c activate

+ ceph-volume raw list /dev/sde
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 174, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 130, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 99, in generate
    info_devices.append(disk.lsblk(dev, abspath=True))
  File "/usr/lib/python3.6/site-packages/ceph_volume/util/disk.py", line 245, in lsblk
    abspath=abspath)
  File "/usr/lib/python3.6/site-packages/ceph_volume/util/disk.py", line 337, in lsblk_all
    raise RuntimeError(f"Error: {err}")

Update

I decided to drain the node, reboot and uncordon. Seems like it didn't make things worse - but the outputs/errors remain the same.

I also tried

pointing the symlink /var/lib/rook/rook-ceph/<id>/block to /dev/sda
deleting the symlink
updating the values.yaml with a per-node filter to only use sda - instead of useAllDevices and useAllNodes.

Issue may be related to #13564

kubectl -n rook-ceph describe pod rook-ceph-osd-0-<id> | grep sde
ROOK_BLOCK_PATH:              /dev/sde
ROOK_BLOCK_PATH:              /dev/sde

kubectl -n rook-ceph describe pod rook-ceph-osd-prepare-<node>-<id> | grep sda
ROOK_DATA_DEVICES:              [{"id":"sda","storeConfig":{"osdsPerDevice":1}}]

Rudi3 · 2024-04-08T13:53:26Z

Rudi3
Apr 8, 2024
Author

Resolved by:

setting udev path of disk:

storage:
    useAllNodes: false
    useAllDevices: false
    nodes:
      - name: node-name
         devices:
           - name. "/dev/disk/by-id/<id>"

And deleting the deployment of the osd: kubectl -n rook-ceph delete deployment rook-ceph-osd-0 as it still contained a reference to the invalid block device. Seeing that the path was locked in the deployment, I'm not sure if the exact path is (1st step) is needed in this case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reboot if you only have one worker node / How to recover after power failure #14037

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to reboot if you only have one worker node / How to recover after power failure #14037

Rudi3 Apr 6, 2024

Replies: 1 comment

Rudi3 Apr 8, 2024 Author

Rudi3
Apr 6, 2024

Rudi3
Apr 8, 2024
Author