Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade storage for mainnet fleet #184

Closed
jakubgs opened this issue May 13, 2024 · 9 comments
Closed

Upgrade storage for mainnet fleet #184

jakubgs opened this issue May 13, 2024 · 9 comments
Assignees

Comments

@jakubgs
Copy link
Member

jakubgs commented May 13, 2024

It's about time we increase the storage available for both Docker containers(Geth) and Systemd services(Beacon Nodes):

[email protected]:~ % df -h / /docker /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       366G   62G  286G  18% /
/dev/sdc        1.5T  1.2T  174G  88% /docker
/dev/sdb        1.5T  1.2T  259G  82% /data

The current layout involves single logical volume per single physical volume(SSD) configured in the controller.

The migration to RAID0 logical volumes using two SSDs using a HPE Smart Array utility is documented here:
https://docs.infra.status.im/general/hp_smart_array_raid.html

The steps for the migration of each host will look like this:

  1. Request attachment of a temporary 1.5 TB(or bigger) SSDs on the host for migrations.
  2. Migrate /data files to temporary migration SSD.
  3. Destroy /data logical volume and re-create it with two physical volumes(SSDs) as one RAID0 logical volume.
  4. Migrate from temporary SSD back to new RAID0 /data volume.
  5. Repeat steps 2, 3, & 4 for the /docker volume.
  6. Inform support they can move the migration SSD to another host, and repeat for that host.

I would recommend creating a single support ticket to order 2 extra SSDs of the same type for all nimbus.mainnet hosts, and then manage migration of each host in the comments of that ticket.

@jakubgs
Copy link
Member Author

jakubgs commented May 13, 2024

You can find more example of me using ssacli to configure volumes here:

Just need to click the Load more... button:

image

@yakimant
Copy link
Member

yakimant commented May 13, 2024

Ticket Created #351756

@yakimant
Copy link
Member

We are able to connect today only 9 SSD disks of 1.6TB capacity (4 servers). If you need all with 1.6TB capacity, then it will be possible to connect the remaining in the next 2 weeks or if you want, we may connect a 3.84TB drive on each of the remaining servers.

The cost for the single additional disk is 20 euro per 1.6TB SSD drive.

Asked about 4TB price and 3TB (if they have)

@yakimant
Copy link
Member

4TB will cost twice of that, will go ahead.

Pros:

  • No RAID0 needed for that disk
  • No need for temporary drive

Cons:

  • RAID0 would provide better speed, but we don't need it

@yakimant
Copy link
Member

ssacli installation:

echo "deb http://downloads.linux.hpe.com/SDR/repo/mcp jammy/current non-free" | sudo tee /etc/apt/sources.list.d/hp-mcp.list
wget -qO- http://downloads.linux.hpe.com/SDR/hpPublicKey1024.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
wget -qO- http://downloads.linux.hpe.com/SDR/hpPublicKey2048.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
wget -qO- http://downloads.linux.hpe.com/SDR/hpPublicKey2048_key1.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
wget -qO- http://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
apt update
apt install ssacli

@yakimant
Copy link
Member

Disk are installed:

❯ ansible nimbus-mainnet-metal -i ansible/inventory/test -a 'sudo ssacli ctrl slot=0 pd allunassigned show'
linux-06.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS SSD, 3.8 TB, OK)
linux-07.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 3.8 TB, OK)
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 2I:2:6 (port 2I:box 2:bay 6, SAS SSD, 3.8 TB, OK)
linux-04.ih-eu-mda1.nimbus.mainnet | FAILED | rc=1 >>

Error: The controller identified by "slot=0" was not detected.non-zero return code
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 2I:2:6 (port 2I:box 2:bay 6, SAS SSD, 3.8 TB, OK)
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS SSD, 3.8 TB, OK)
linux-05.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS SSD, 3.8 TB, OK)

linux-04 has a different slot:

❯ sudo ssacli ctrl slot=1 pd allunassigned show

Smart Array P222 in Slot 1

   Unassigned

      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS SSD, 3.8 TB, OK)

@yakimant
Copy link
Member

IH had an issue with disks. They fixed in on linux-01 and I was able to set them up.

Was done with approximately these actions:

sudo ssacli ctrl slot=0 pd all show status
sudo ssacli ctrl slot=0 create type=ld drives=DRIVE (cheange here, eg 2I:2:6)
sudo ssacli ctrl slot=0 ld all show status
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=linux-01.ih-eu-mda1.nimbus.mainnet -Dv -t role::bootstrap:volumes
docker-compose -f docker-compose.exporter.yml -f docker-compose.yml stop
sudo systemctl stop syslog
sudo rsync -Pa /mnt/sdc/geth-mainnet /mnt/sdd/geth-mainnet
[ansible/group_vars/ih-eu-mda1.yml] change /docker
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=linux-01.ih-eu-mda1.nimbus.mainnet -Dv -t role::bootstrap:volumes
docker-compose -f docker-compose.exporter.yml -f docker-compose.yml start
[grafana] check geth graphs - syncing
sudo systemctl stop beacon-node-mainnet-*
sudo rsync -Pa sdb/beacon-node-mainnet-* sdb/era sdd/sdb/
sudo ssacli ctrl slot=0 ld all show status
sudo ssacli ctrl slot=0 ld 2 delete
sudo ssacli ctrl slot=0 ld 3 delete
sudo ssacli ctrl slot=0 pd all show status
sudo ssacli ctrl slot=0 create type=ld drives=DRIVE1,DRIVE2 raid=0 # (eg 1I:2:1,1I:2:2)
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=linux-01.ih-eu-mda1.nimbus.mainnet -Dv -t role::bootstrap:volumes
sudo systemctl start beacon-node-mainnet-*
[grafana] check nimbus graphs - syncing

Need to fine-tune a bit:

  • rsync is slow, maybe could be smth faster
  • better to stop, move and setup disks in one go, not geth/nimbus separately
  • I should do umount before running ansible

@yakimant
Copy link
Member

yakimant commented May 29, 2024

BTW if docs or support asks about using the different tool - here is the timeline

  1. hpacucli (versions 9.10 - 9.40, 2012 - 2014, probaly before too)
  2. hpssacli (versions 2.0 - 2.40, 2014 - 2016)
  3. ssacli (versions 3.10 - 6.30, 2017 - now)

@yakimant
Copy link
Member

Done, disks are setup with these commands:

[local] ansible linux-05.ih-eu-mda1.nimbus.mainnet,linux-06.ih-eu-mda1.nimbus.mainnet,linux-07.ih-eu-mda1.nimbus.mainnet -a 'sudo systemctl stop consul'
sudo ssacli ctrl slot=0 pd all show status; sudo ssacli ctrl slot=0 ld all show status
sudo ssacli ctrl slot=0 create type=ld drives=DRIVE (cheange here, eg 2I:2:6)
sudo ssacli ctrl slot=0 ld all show status
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=HOSTNAME -Dv -t role::bootstrap:volumes
docker-compose -f /docker/geth-mainnet/docker-compose.exporter.yml -f /docker/geth-mainnet/docker-compose.yml stop
sudo systemctl stop syslog beacon-node-mainnet-*
sudo rsync --stats -hPa --info=progress2,name0 /docker/geth-mainnet /docker/log /data/beacon* /data/era /mnt/sdd/
sudo umount /mnt/sdb /data /docker /mnt/sdc
sudo ssacli ctrl slot=0 ld all show status
sudo ssacli ctrl slot=0 ld 2 delete
sudo ssacli ctrl slot=0 ld 3 delete
sudo ssacli ctrl slot=0 pd all show status
sudo ssacli ctrl slot=0 create type=ld raid=0 drives=DRIVE1,DRIVE2  # (eg 1I:2:1,1I:2:2)
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=HOSTNAME -Dv -t role::bootstrap:volumes
sudo rsync --stats -hPa --info=progress2,name0 /docker/beacon* /docker/era /data/
docker-compose -f /docker/geth-mainnet/docker-compose.exporter.yml -f /docker/geth-mainnet/docker-compose.yml start
sudo systemctl start beacon-node-mainnet-* syslog
[grafana] check geth graphs - syncing
[grafana] check nimbus graphs - syncing

Smth is missing before the second ansible-playbook run, causing the issues:

  • Error mounting /mnt/sdb: mount: /mnt/sdb: mount(2) system call failed: Structure needs cleaning.
    • sudo mkfs.ext4 /dev/sdb -L DATA_VOLUME1
  • tune2fs: Journal must be at least 1024 blocks while recovering journal.
    • sudo e2label /dev/sdb DATA_VOLUME1

I didn't research much and just put command above.

One finding - linux-02.ih-eu-mda1.nimbus.mainnethas some wierd disk attached, but not others:

❯ lsblk /dev/sdd

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdd      8:48   0  256M  1 disk
`-sdd1   8:49   0  251M  1 part

❯ sudo fdisk -l /dev/sdd

Disk /dev/sdd: 256 MiB, 268435456 bytes, 524288 sectors
Disk model: LUN 00 Media 0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000046

Device     Boot Start    End Sectors  Size Id Type
/dev/sdd1          63 514079  514017  251M  c W95 FAT32 (LBA)

❯ ls -l /dev/disk/by-id /dev/disk/by-path/ | grep sdd
lrwxrwxrwx 1 root root  9 Feb 22 19:05 usb-HP_iLO_LUN_00_Media_0_000002660A01-0:0 -> ../../sdd
lrwxrwxrwx 1 root root 10 Feb 22 19:05 usb-HP_iLO_LUN_00_Media_0_000002660A01-0:0-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  9 Feb 22 19:05 pci-0000:00:1d.0-usb-0:1.3.1:1.0-scsi-0:0:0:0 -> ../../sdd
lrwxrwxrwx 1 root root 10 Feb 22 19:05 pci-0000:00:1d.0-usb-0:1.3.1:1.0-scsi-0:0:0:0-part1 -> ../../sdd1

Looks like it's some USB 256Mb drive.

ChatGPT says it can smth to do with HP iLO (Integrated Lights-Out) LUN (Logical Unit Number). I have no idea, whats that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants