We install the qemu-guest-agent package in ensure_packages_installed().
Try to start the qemu-guest-agent service only afterwards therefore.
Fixup for commit 82e6638b40
Change-Id: Ic4aa2e493851b4c92ac134d68a9a76e05485658d
(cherry picked from commit cf94193f88)
(cherry picked from commit 6d3c733314)
/dev/virtio-ports/org.qemu.guest_agent.0 usually is a symlink to the
character device /dev/vport1p1. So adjust the device check accordingly
and only verify it exists, but don't expected any special file type.
This actually matches the behavior we also have in ngcp-installer.
Fixup for commit 82e6638b40
Change-Id: I0aa93c1f0e1086847eb7ed6967692a52e183bdc3
(cherry picked from commit 4a292ab4be)
(cherry picked from commit 2e674fe092)
Now that we enabled the QEMU Guest Agent option for our PVE VMs, we need
to have qemu-guest-agent present and active. Otherwise the VMs might
fail to shut down, like with our debian/sipwise/docker Debian systems
which are created via
https://jenkins.mgm.sipwise.com/job/daily-build-matrix-debian-boxes/:
| [proxmox-vm-shutdown] $ /bin/sh -e /tmp/env-proxmox-vm-shutdown7956268380939677154.sh
| [environment-script] Adding variable 'vm1reset' with value 'NO'
| [environment-script] Adding variable 'vm2' with value 'none'
| [environment-script] Adding variable 'vm1' with value 'none'
| [environment-script] Adding variable 'vm2reset' with value 'NO'
| [proxmox-vm-shutdown] $ /bin/bash /tmp/jenkins14192704603218787414.sh
| Using safe VM 'shutdown' for modern releases (mr6.5+). Executing action 'shutdown'...
| Shutting down VM 106
| Build timed out (after 10 minutes). Marking the build as aborted.
| Build was aborted
| [WS-CLEANUP] Deleting project workspace...
Let's make sure qemu-guest-agent is available in our Grml live system.
We added qemu-guest-agent to the package list of our Grml Sipwise ISO
(see git rev 65c3fea4c), but to ensure we don't strictly depend on this
brand new Grml Sipwise ISO yet, make sure to install it on-the-fly if
not yet present (like we already did for git, augeas-tools + gdisk).
Also make sure qemu-guest-agent service is enabled if socket
/dev/virtio-ports/org.qemu.guest_agent.0 is present (indicating that the
agent feature is enabled on VM level).
Furthermore ensure qemu-guest-agent is present also in the installed
Debian system. Otherwise when rebooting the VM once it's no longer
running the Grml live system but the installed Debian system, it might
also fail to shutdown. So add it to the default package list of packages
for bootstrapping.
Change-Id: Id6adac55a47cfaed542cad2f9ac9740783e6d924
(cherry picked from commit 82e6638b40)
(cherry picked from commit b00792606c)
With this variable we had some tricks in ngcp-initial-configuration if
the Pro sp2 node is installer via iPXE/cm image.
Now we support installation of sp2 via iPXE only so no need to pass this
variable.
But we need to keep parent ngcppxeinstall parameter as we need this
information for netcardconfig.
Change-Id: I20491289917cbb427ad6f5670f108c632838be71
We are dropping the scenario when sp2 node is installed from cd image so
remove appropriate part of the code.
Change-Id: Idced6b43a21add903dca070aa68f84b77acba28e
The code trying to fetch the OpenPGP certificate from a keyserver has
been non-functional for a while as the GPG_KEY_SERVER variable was
removed in commit 316c28bcc2. Instead of
restoring the variable with an up-to-date keyserver (not part of the
SKS pool, as that network is dead), we remove the support entirely as
it's a potential security issue due to fingerprint collisions for
example.
As a side effect this removes apt-key usage which has been deprecated
upstream and is slated for removal.
Change-Id: I63171a66201c631da9233d54579bd1601ff22e3e
Fresh deployments with SW-RAID (Software-RAID) might fail if the present
disks were already part of an SW-RAID setup:
| Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.
We could also reproduce this inside PVE VMs:
| mdadm: /dev/md/127 has been started with 2 drives.
| Error: disk sda seems to be part of an existing SW-RAID setup.
This is caused by the following behavior:
| + SWRAID_DEVICE="/dev/md0"
| [...]
| + mdadm --assemble --scan
| + true
| + [[ -b /dev/md0 ]]
| + for disk in "${SWRAID_DISK1}" "${SWRAID_DISK2}"
| + grep -q nvme1n1 /proc/mdstat
| + die 'Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.'
| + echo 'Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.'
| Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.
By default we expect and set the SWRAID_DEVICE to be /dev/md0. But only
"local" arrays get assembled as /dev/md0 and upwards, whereas "foreign"
arrays start at md127 downwards. This is exactly what we get when
booting our deployment live system on top of an existing installation,
and assemble existing SW-RAIDs (to not overwrite unexpected disks by
mistake):
| root@grml ~ # lsblk
| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/ngcp.squashfs
| /run/live/rootfs/ngcp.squashfs
| nvme0n1 259:0 0 447.1G 0 disk
| └─md127 9:127 0 447.1G 0 raid1
| ├─md127p1 259:14 0 18G 0 part
| ├─md127p2 259:15 0 18G 0 part
| ├─md127p3 259:16 0 405.6G 0 part
| ├─md127p4 259:17 0 512M 0 part
| ├─md127p5 259:18 0 4G 0 part
| └─md127p6 259:19 0 1G 0 part
| nvme1n1 259:7 0 447.1G 0 disk
| └─md127 9:127 0 447.1G 0 raid1
| ├─md127p1 259:14 0 18G 0 part
| ├─md127p2 259:15 0 18G 0 part
| ├─md127p3 259:16 0 405.6G 0 part
| ├─md127p4 259:17 0 512M 0 part
| ├─md127p5 259:18 0 4G 0 part
| └─md127p6 259:19 0 1G 0 part
|
| root@grml ~ # lsblk -l -n -o TYPE,NAME
| loop loop0
| raid1 md127
| disk nvme0n1
| disk nvme1n1
| part md127p1
| part md127p2
| part md127p3
| part md127p4
| part md127p5
| part md127p6
|
| root@grml ~ # cat /proc/cmdline
| vmlinuz initrd=initrd.img swraiddestroy swraiddisk2=nvme0n1 swraiddisk1=nvme1n1 [...]
Let's identify existing RAID devices and check their configuration by
going through the disks and comparing them with our SWRAID_DISK1 and
SWRAID_DISK2. If they don't match with each other, we stop execution to
prevent any possible data damage.
Furthermore, we need to assemble the mdadm array without relying on a
possibly existing local `/etc/mdadm/mdadm.conf` configuration file.
Otherwise assembling might fail:
| root@grml ~ # cat /proc/mdstat
| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
| unused devices: <none>
| root@grml ~ # lsblk -l -n -o TYPE,NAME | awk '/^raid/ {print $2}'
| root@grml ~ # grep ARRAY /etc/mdadm/mdadm.conf
| ARRAY /dev/md/127 metadata=1.0 UUID=0d44774e:7269bac6:2f02f337:4551597b name=localhost:127
| root@grml ~ # mdadm --assemble --scan
| 2 root@grml ~ # mdadm --assemble --scan --verbose
| mdadm: looking for devices for /dev/md/127
| mdadm: No super block found on /dev/loop0 (Expected magic a92b4efc, got 800989c0)
| mdadm: no RAID superblock on /dev/loop0
| mdadm: No super block found on /dev/nvme1n1p3 (Expected magic a92b4efc, got 00000000)
| mdadm: no RAID superblock on /dev/nvme1n1p3
| mdadm: No super block found on /dev/nvme1n1p2 (Expected magic a92b4efc, got 00000000)
| mdadm: no RAID superblock on /dev/nvme1n1p2
| mdadm: No super block found on /dev/nvme1n1p1 (Expected magic a92b4efc, got 000080fe)
| mdadm: no RAID superblock on /dev/nvme1n1p1
| mdadm: No super block found on /dev/nvme1n1 (Expected magic a92b4efc, got 00000000)
| mdadm: no RAID superblock on /dev/nvme1n1
| mdadm: No super block found on /dev/nvme0n1p3 (Expected magic a92b4efc, got 00000000)
| mdadm: no RAID superblock on /dev/nvme0n1p3
| mdadm: No super block found on /dev/nvme0n1p2 (Expected magic a92b4efc, got 00000000)
| mdadm: no RAID superblock on /dev/nvme0n1p2
| mdadm: No super block found on /dev/nvme0n1p1 (Expected magic a92b4efc, got 000080fe)
| mdadm: no RAID superblock on /dev/nvme0n1p1
| mdadm: No super block found on /dev/nvme0n1 (Expected magic a92b4efc, got 00000000)
| mdadm: no RAID superblock on /dev/nvme0n1
| 2 root@grml ~ # mdadm --assemble --scan --config /dev/null
| mdadm: /dev/md/grml:127 has been started with 2 drives.
| root@grml ~ # lsblk -l -n -o TYPE,NAME | awk '/^raid/ {print $2}'
| md127
By running mdadm assemble with `--config /dev/null`, we prevent
consideration and usage of a possibly existing /etc/mdadm/mdadm.conf
configuration file.
Example output of running the new code:
| [...]
| mdadm: No arrays found in config file or automatically
| NOTE: default SWRAID_DEVICE set to /dev/md0 though we identified active md127
| NOTE: will continue with '/dev/md127' as SWRAID_DEVICE for mdadm cleanup
| Wiping signatures from /dev/md127
| /dev/md127: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
| Removing mdadm device /dev/md127
| Stopping mdadm device /dev/md127
| mdadm: stopped /dev/md127
| Zero-ing superblock from /dev/nvme1n1
| mdadm: Unrecognised md component device - /dev/nvme1n1
| Zero-ing superblock from /dev/nvme0n1
| mdadm: Unrecognised md component device - /dev/nvme0n1
| NOTE: modified RAID array detected, setting SWRAID_DEVICE back to original setting '/dev/md0'
| Removing possibly existing LVM/PV label from /dev/nvme1n1
| Cannot use /dev/nvme1n1: device is partitioned
| Removing possibly existing LVM/PV label from /dev/nvme1n1p1
| Cannot use /dev/nvme1n1p1: device is too small (pv_min_size)
| Removing possibly existing LVM/PV label from /dev/nvme1n1p2
| Labels on physical volume "/dev/nvme1n1p2" successfully wiped.
| Removing possibly existing LVM/PV label from /dev/nvme1n1p3
| Cannot use /dev/nvme1n1p3: device is an md component
| Wiping disk signatures from /dev/nvme1n1
| /dev/nvme1n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme1n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme1n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
| /dev/nvme1n1: calling ioctl to re-read partition table: Success
| 1+0 records in
| 1+0 records out
| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0027866 s, 376 MB/s
| Removing possibly existing LVM/PV label from /dev/nvme0n1
| Cannot use /dev/nvme0n1: device is partitioned
| Removing possibly existing LVM/PV label from /dev/nvme0n1p1
| Cannot use /dev/nvme0n1p1: device is too small (pv_min_size)
| Removing possibly existing LVM/PV label from /dev/nvme0n1p2
| Labels on physical volume "/dev/nvme0n1p2" successfully wiped.
| Removing possibly existing LVM/PV label from /dev/nvme0n1p3
| Cannot use /dev/nvme0n1p3: device is an md component
| Wiping disk signatures from /dev/nvme0n1
| /dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme0n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
| /dev/nvme0n1: calling ioctl to re-read partition table: Success
| 1+0 records in
| 1+0 records out
| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00278955 s, 376 MB/s
| Creating partition table
| Get path of EFI partition
| pvdevice is now available: /dev/nvme1n1p2
| The operation has completed successfully.
| The operation has completed successfully.
| pvdevice is now available: /dev/nvme1n1p3
| pvdevice is now available: /dev/nvme0n1p3
| mdadm: /dev/nvme1n1p3 appears to be part of a raid array:
| level=raid1 devices=2 ctime=Wed Jan 24 10:31:43 2024
| mdadm: Note: this array has metadata at the start and
| may not be suitable as a boot device. If you plan to
| store '/boot' on this device please ensure that
| your boot-loader understands md/v1.x metadata, or use
| --metadata=0.90
| mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
| level=raid1 devices=2 ctime=Wed Jan 24 10:31:43 2024
| mdadm: size set to 468218880K
| mdadm: automatically enabling write-intent bitmap on large array
| Continue creating array? mdadm: Defaulting to version 1.2 metadata
| mdadm: array /dev/md0 started.
| Creating PV + VG on /dev/md0
| Physical volume "/dev/md0" successfully created.
| Volume group "ngcp" successfully created
| 0 logical volume(s) in volume group "ngcp" now active
| Creating LV 'root' with 10G
| [...]
|
| mdadm: stopped /dev/md127
| mdadm: No arrays found in config file or automatically
| NOTE: will continue with '/dev/md127' as SWRAID_DEVICE for mdadm cleanup
| Removing mdadm device /dev/md127
| Stopping mdadm device /dev/md127
| mdadm: stopped /dev/md127
| mdadm: Unrecognised md component device - /dev/nvme1n1
| mdadm: Unrecognised md component device - /dev/nvme0n1
| mdadm: /dev/nvme1n1p3 appears to be part of a raid array:
| mdadm: Note: this array has metadata at the start and
| mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
| mdadm: size set to 468218880K
| mdadm: automatically enabling write-intent bitmap on large array
| Continue creating array? mdadm: Defaulting to version 1.2 metadata
| mdadm: array /dev/md0 started.
| lvm2 mdadm wget
| Get:1 http://http-proxy.lab.sipwise.com/debian bookworm/main amd64 mdadm amd64 4.2-5 [443 kB]
| Selecting previously unselected package mdadm.
| Preparing to unpack .../0-mdadm_4.2-5_amd64.deb ...
| Unpacking mdadm (4.2-5) ...
| Setting up mdadm (4.2-5) ...
| [...]
| mdadm: stopped /dev/md0
Change-Id: Ib5875248e9c01dd4251bfab2cc4c94daace503fa
Deployed current NGCP trunk on NVMe powered SW-RAID setup failed with:
| mdadm: size set to 468218880K
| mdadm: automatically enabling write-intent bitmap on large array
| Continue creating array? mdadm: Defaulting to version 1.2 metadata
| mdadm: array /dev/md0 started.
| Creating PV + VG on /dev/md0
| Cannot use /dev/md0: device is partitioned
This is caused because /dev/md0 still contains partition data, and
its nvme1n1p3 also still has disk signature about linux_raid_member.
So it's *not* enough to stop the mdadm array, remove PV/LVM information
from the partitions and finally wipe SW-RAID disks /dev/nvme1n1 +
/dev/nvme0n1 (example output from such a failing run):
| mdadm: /dev/md/0 has been started with 2 drives.
| mdadm: stopped /dev/md0
| mdadm: Unrecognised md component device - /dev/nvme1n1
| mdadm: Unrecognised md component device - /dev/nvme0n1
| Removing possibly existing LVM/PV label from /dev/nvme1n1
| Cannot use /dev/nvme1n1: device is partitioned
| Removing possibly existing LVM/PV label from /dev/nvme1n1p1
| Cannot use /dev/nvme1n1p1: device is too small (pv_min_size)
| Removing possibly existing LVM/PV label from /dev/nvme1n1p2
| Labels on physical volume "/dev/nvme1n1p2" successfully wiped.
| Removing possibly existing LVM/PV label from /dev/nvme1n1p3
| Cannot use /dev/nvme1n1p3: device is an md component
| Wiping disk signatures from /dev/nvme1n1
| /dev/nvme1n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme1n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme1n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
| /dev/nvme1n1: calling ioctl to re-read partition table: Success
| 1+0 records in
| 1+0 records out
| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00314195 s, 334 MB/s
| Removing possibly existing LVM/PV label from /dev/nvme0n1
| Cannot use /dev/nvme0n1: device is partitioned
| Removing possibly existing LVM/PV label from /dev/nvme0n1p1
| Cannot use /dev/nvme0n1p1: device is too small (pv_min_size)
| Removing possibly existing LVM/PV label from /dev/nvme0n1p2
| Labels on physical volume "/dev/nvme0n1p2" successfully wiped.
| Removing possibly existing LVM/PV label from /dev/nvme0n1p3
| Cannot use /dev/nvme0n1p3: device is an md component
| Wiping disk signatures from /dev/nvme0n1
| /dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme0n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54
| /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
| /dev/nvme0n1: calling ioctl to re-read partition table: Success
| 1+0 records in
| 1+0 records out
| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00893285 s, 117 MB/s
| Creating partition table
| Get path of EFI partition
| pvdevice is now available: /dev/nvme1n1p2
| The operation has completed successfully.
| The operation has completed successfully.
| pvdevice is now available: /dev/nvme1n1p3
| pvdevice is now available: /dev/nvme0n1p3
| mdadm: /dev/nvme1n1p3 appears to be part of a raid array:
| level=raid1 devices=2 ctime=Wed Dec 20 20:35:21 2023
| mdadm: Note: this array has metadata at the start and
| may not be suitable as a boot device. If you plan to
| store '/boot' on this device please ensure that
| your boot-loader understands md/v1.x metadata, or use
| --metadata=0.90
| mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
| level=raid1 devices=2 ctime=Wed Dec 20 20:35:21 2023
| mdadm: size set to 468218880K
| mdadm: automatically enabling write-intent bitmap on large array
| Continue creating array? mdadm: Defaulting to version 1.2 metadata
| mdadm: array /dev/md0 started.
| Creating PV + VG on /dev/md0
| Cannot use /dev/md0: device is partitioned
Instead we also need to wipe signatures from the SW-RAID device (like
/dev/md0), only then stop it, ensure we wipe disk signatures also from
all the partitions (like /dev/nvme1n1p3) and only then finally remove
the disk signatures from the main block device (like /dev/nvme1n1).
Example from a successful run with this change:
| root@grml ~ # grep -e mdadm -e Wiping /tmp/deployment-installer-debug.log
| mdadm: /dev/md/0 has been started with 2 drives.
| Wiping signatures from /dev/md0
| Removing mdadm device /dev/md0
| Stopping mdadm device /dev/md0
| mdadm: stopped /dev/md0
| mdadm: Unrecognised md component device - /dev/nvme1n1
| mdadm: Unrecognised md component device - /dev/nvme0n1
| Wiping disk signatures from partition /dev/nvme1n1p1
| Wiping disk signatures from partition /dev/nvme1n1p2
| Wiping disk signatures from partition /dev/nvme1n1p3
| Wiping disk signatures from /dev/nvme1n1
| Wiping disk signatures from partition /dev/nvme0n1p1
| Wiping disk signatures from partition /dev/nvme0n1p2
| Wiping disk signatures from partition /dev/nvme0n1p3
| Wiping disk signatures from /dev/nvme0n1
| mdadm: Note: this array has metadata at the start and
| mdadm: size set to 468218880K
| mdadm: automatically enabling write-intent bitmap on large array
| Continue creating array? mdadm: Defaulting to version 1.2 metadata
| mdadm: array /dev/md0 started.
| Wiping ext3 signature on /dev/ngcp/root.
| Wiping ext4 signature on /dev/ngcp/fallback.
| Wiping ext4 signature on /dev/ngcp/data.
While at it, be more verbose about the executed steps.
FTR, disk and setup information of such a system where we noticed the
failure and worked on this change:
| root@grml ~ # fdisk -l
| Disk /dev/nvme0n1: 447.13 GiB, 480103981056 bytes, 937703088 sectors
| Disk model: DELL NVME ISE PE8010 RI M.2 480GB
| Units: sectors of 1 * 512 = 512 bytes
| Sector size (logical/physical): 512 bytes / 512 bytes
| I/O size (minimum/optimal): 512 bytes / 512 bytes
| Disklabel type: gpt
| Disk identifier: 5D296676-52CF-49CF-863A-6D3A3BD0604F
|
| Device Start End Sectors Size Type
| /dev/nvme0n1p1 2048 4095 2048 1M BIOS boot
| /dev/nvme0n1p2 4096 999423 995328 486M EFI System
| /dev/nvme0n1p3 999424 937701375 936701952 446.7G Linux RAID
|
|
| Disk /dev/nvme1n1: 447.13 GiB, 480103981056 bytes, 937703088 sectors
| Disk model: DELL NVME ISE PE8010 RI M.2 480GB
| Units: sectors of 1 * 512 = 512 bytes
| Sector size (logical/physical): 512 bytes / 512 bytes
| I/O size (minimum/optimal): 512 bytes / 512 bytes
| Disklabel type: gpt
| Disk identifier: 9AFA8ACF-D2CD-4224-BA0C-D38A6581D0F9
|
| Device Start End Sectors Size Type
| /dev/nvme1n1p1 2048 4095 2048 1M BIOS boot
| /dev/nvme1n1p2 4096 999423 995328 486M EFI System
| /dev/nvme1n1p3 999424 937701375 936701952 446.7G Linux RAID
| [...]
|
| root@grml ~ # lsblk
| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/ngcp.squashfs
| /run/live/rootfs/ngcp.squashfs
| nvme0n1 259:0 0 447.1G 0 disk
| ├─nvme0n1p1 259:5 0 1M 0 part
| ├─nvme0n1p2 259:8 0 486M 0 part
| └─nvme0n1p3 259:9 0 446.7G 0 part
| └─md0 9:0 0 446.5G 0 raid1
| ├─ngcp-root 253:0 0 10G 0 lvm /mnt
| ├─ngcp-fallback 253:1 0 10G 0 lvm
| └─ngcp-data 253:2 0 383.9G 0 lvm /mnt/ngcp-data
| nvme1n1 259:4 0 447.1G 0 disk
| ├─nvme1n1p1 259:2 0 1M 0 part
| ├─nvme1n1p2 259:6 0 486M 0 part
| └─nvme1n1p3 259:7 0 446.7G 0 part
| └─md0 9:0 0 446.5G 0 raid1
| ├─ngcp-root 253:0 0 10G 0 lvm /mnt
| ├─ngcp-fallback 253:1 0 10G 0 lvm
| └─ngcp-data 253:2 0 383.9G 0 lvm /mnt/ngcp-data
|
| root@grml ~ # cat /proc/mdstat
| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
| md0 : active raid1 nvme0n1p3[1] nvme1n1p3[0]
| 468218880 blocks super 1.2 [2/2] [UU]
| [==>..................] resync = 12.7% (59516864/468218880) finish=33.1min speed=205685K/sec
| bitmap: 4/4 pages [16KB], 65536KB chunk
|
| unused devices: <none>
Change-Id: Iaa7f49eef11ef6ad6209fe962bb8940a75a87c95
We get the following error message in /var/log/vboxadd-install.log,
/var/log/deployment-installer-debug.log, /var/log/daemon.log +
/var/log/syslog:
| /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient: error while loading shared libraries: libXmu.so.6: cannot open shared object file: No such file or directory
This is caused by missing libxmu6:
| [sipwise-lab-trunk] sipwise@spce:~$ /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient --help
| /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient: error while loading shared libraries: libXmu.so.6: cannot open shared object file: No such file or directory
| [sipwise-lab-trunk] sipwise@spce:~$ sudo apt install libxmu6
| Reading package lists... Done
| Building dependency tree... Done
| Reading state information... Done
| The following NEW packages will be installed:
| libxmu6
| 0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
| Need to get 60.1 kB of archives.
| After this operation, 143 kB of additional disk space will be used.
| Get:1 https://debian.sipwise.com/debian bookworm/main amd64 libxmu6 amd64 2:1.1.3-3 [60.1 kB]
| Fetched 60.1 kB in 0s (199 kB/s)
| [...]
| [sipwise-lab-trunk] sipwise@spce:~$ /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient --help
| Oracle VM VirtualBox VBoxClient 7.0.6
| Copyright (C) 2005-2023 Oracle and/or its affiliates
|
| Usage: VBoxClient --clipboard|--draganddrop|--checkhostversion|--seamless|--vmsvga|--vmsvga-session
| [-d|--nodaemon]
|
| Options:
| [...]
It looks like lack of libxmu6 doesn't cause any actual problems for our
use case (we don't use X.org at all), though given that libxmu6 is a
small library package, let's try to get it working as expected and avoid
the alarming errors on the logs.
Thanks Guillem Jover for spotting and reporting
Change-Id: I65f3dd496a4026f04fd9944fd7cc43d6abbdf336
During initial deployment of a system, we get warnings about
lack of zstd:
| Setting up linux-image-6.1.0-13-amd64 (6.1.55-1) ...
| I: /vmlinuz.old is now a symlink to boot/vmlinuz-6.1.0-13-amd64
| I: /initrd.img.old is now a symlink to boot/initrd.img-6.1.0-13-amd64
| I: /vmlinuz is now a symlink to boot/vmlinuz-6.1.0-13-amd64
| I: /initrd.img is now a symlink to boot/initrd.img-6.1.0-13-amd64
| /etc/kernel/postinst.d/initramfs-tools:
| update-initramfs: Generating /boot/initrd.img-6.1.0-13-amd64
| W: No zstd in /usr/bin:/sbin:/bin, using gzip
| [...]
The initramfs generation and update overall runs *four* times within the
initial bootstrapping of a system (we'll try to do something about this,
but this is outside the scope of this).
As of initramfs-tools v0.141, initramfs-tools uses zstd as default
compression for initramfs. Version 0.142 is shipped with
Debian/bookworm, and therefore it makes sense to have it available
upfront. Note that also the initrd generation is faster with zstd
(~10sec for zstd vs. ~13sec for gzip) and also the resulting initrd is
smaller (~33MB for zstd vs ~39MB for gzip).
By making sure that zstd is available straight from the very beginning
and before ngcp-installer pulls it in later, we can avoid the warning
message but also save >10 seconds of install time.
Given that zstd is available even in Debian oldoldstable, let's install
it unconditionally in all our systems.
Thanks: Volodymyr Fedorov for reporting
Change-Id: I56674c3c213f7c7a6e6cbce3c8e2e00a4cfbdbd4
Even though the ntpsec.service contains an Alias for ntp.service,
that does not work for us when the service has not yet been installed,
so the first run will fail. Use the actual name to avoid this issue.
Change-Id: I8f0ee3b38390a7e58c3bbee65fd96bfd4b717dfa
It's better to have this package in grml-sipwise image so any system
with this network card can use all it's power even in deployment stage.
Change-Id: I765efcf446a410a42ef156b2ccc2e6612a33ddd6
Let's restore system state of /run/systemd/system for
VBoxLinuxAdditions, to avoid any unexpected side effects.
Followup for git rev 8601193
Change-Id: I632c7d60ebb627c3a80d4c1f9b264d6d0a13b4f1
Recent Grml ISOs, including our Grml-Sipwise ISO (v2023-06-01), include
grml-autoconfig v0.20.3 which execute the grml-autoconfig service under
`StandardInput=null`. This is necessary to not conflict with tty usage,
like used with serial console. See
1e268ffe4f
Now that we run with /dev/null for stdin, we can't interact with the
user, so let's try to detect when running from within grml-autoconfig's
systemd unit, and if so assume that we're executing on /dev/tty1 and
use/reopen that for stdin.
Change-Id: Id55283c7f862487a6ef8acb8ab01f67a05bd8dd7
As of git rev 6c960afee4 we're using the
virtualbox-guest-additions-iso from bookworm.
Previous versions of VBoxGuestAdditions had a simple test to check for
present of systemd, quoting from
/opt/VBoxGuestAdditions-6.1.22/routines.sh:
| use_systemd()
| {
| test ! -f /sbin/init || test -L /sbin/init
| }
Now in more recent versions of VBoxGuestAdditions[1], the systemd check
was modified, quoting from /opt/VBoxGuestAdditions-7.0.6/routines.sh:
| use_systemd()
| {
| # First condition is what halfway recent systemd uses itself, and the
| # other two checks should cover everything back to v1.
| test -e /run/systemd/system || test -e /sys/fs/cgroup/systemd || test -e /cgroup/systemd
| }
So if we're running inside a chroot as with our deployment.sh, it looks
like a non-systemd system for VBoxGuestAdditions's installer, and we end
up with installation and presence of /etc/init.d/vboxadd, leading to:
| root@spce:~# ls -lah /run/systemd/generator.late/
| total 4.0K
| drwxr-xr-x 4 root root 100 Jul 18 00:20 .
| drwxr-xr-x 23 root root 580 Jul 18 00:20 ..
| drwxr-xr-x 2 root root 60 Jul 18 00:20 graphical.target.wants
| drwxr-xr-x 2 root root 60 Jul 18 00:20 multi-user.target.wants
| -rw-r--r-- 1 root root 537 Jul 18 00:20 vboxadd.service
|
| root@spce:~# systemctl cat vboxadd.service
| # /run/systemd/generator.late/vboxadd.service
| # Automatically generated by systemd-sysv-generator
|
| [Unit]
| Documentation=man:systemd-sysv-generator(8)
| SourcePath=/etc/init.d/vboxadd
| Description=LSB: VirtualBox Linux Additions kernel modules
| Before=multi-user.target
| Before=multi-user.target
| Before=multi-user.target
| Before=graphical.target
| Before=display-manager.service
|
| [Service]
| Type=forking
| Restart=no
| TimeoutSec=5min
| IgnoreSIGPIPE=no
| KillMode=process
| GuessMainPID=no
| RemainAfterExit=yes
| SuccessExitStatus=5 6
| ExecStart=/etc/init.d/vboxadd start
| ExecStop=/etc/init.d/vboxadd stop
We don't expect any init scripts to be present, as all our services must
have systemd unit files. Therefore we check for absence of systemd's
/run/systemd/generator.late in our system-tests, which started to fail
with the upgrade to VBoxGuestAdditions-v7.0.6 due to the systemd
presence detection mentioned above.
Let's fake presence of systemd before invoking VBoxGuestAdditions's
installer, to avoid ending up with unexpected vbox* init scripts.
[1] See svn rev 92682:
https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Installer/linux/routines.sh?rev=92682https://www.virtualbox.org/changeset?old=92681&old_path=vbox%2Ftrunk%2Fsrc%2FVBox%2FInstaller%2Flinux%2Froutines.sh&new=92682&new_path=vbox%2Ftrunk%2Fsrc%2FVBox%2FInstaller%2Flinux%2Froutines.sh
Change-Id: Ifd11460e3a8fd4f4c1269453a9b8376065861b8e
Support bookworm option in DEBIAN_RELEASE selection. We have support
for it already.
Use bookworm as fallback since nowadays we jumped to it.
Change-Id: I118c1b5cf81fe57394495b5f745fc81032406c78
To be able to upgrade our internal systems to Debian/bookworm
we need to have puppet packages available.
Upstream still doesn't provide any Debian packages
(see https://tickets.puppetlabs.com/browse/PA-4995),
though their AIO (All In One) packages for Debian/bullseye
seem to be working on Debian/bookworm as well (at least for
puppet-agent). So until we either migrated to puppet-agent
as present in Debian/bookworm or upstream provides according
AIO packages, let's use the puppet-agent packages we already
use for our Debian/bullseye systems.
Change-Id: I2211ffd79f70a2a79873e737b0b512bfb7492328
Since version 1.20.0, dpkg no longer creates /var/lib/dpkg/available
(see #647911). Now that we upgraded our Grml-Sipwise deployment system
to bookworm, we have dpkg v1.21.22 on our live system, and mmdebstrap
relies on dpkg of the host system for execution.
But on Debian releases until and including buster, dpkg fails to operate
with e.g. `dpkg --set-selections`, if /var/lib/dpkg/available doesn't
exist:
| The following NEW packages will be installed:
| nullmailer
| [...]
| debconf: delaying package configuration, since apt-utils is not installed
| dpkg: error: failed to open package info file '/var/lib/dpkg/available' for reading: No such file or directory
We *could* also switch from mmdebstrap to debootstrap for deploying
Debian releases <=buster, but this would be slower and we use mmdebstrap
since quite some time for everything. So instead let's create
/var/lib/dpkg/available after bootstrapping the system.
Reported towards mmdebstrap as #1037946.
Change-Id: I0a87ca255d5eb7144a9c093051c0a6a3114a3c0b
Now that our deployment system is based on Debian/bookworm, but our
gerrit/git server still runs on Debian/bullseye, we run into the OpenSSH
RSA issue (RSA signatures using the SHA-1 hash algorithm got disabled by default), see
https://michael-prokop.at/blog/2023/06/11/what-to-expect-from-debian-bookworm-newinbookworm/
and https://www.jhanley.com/blog/ssh-signature-algorithm-ssh-rsa-error/
We need to enable ssh-rsa usage, otherwise deployment fails with:
| Warning: Permanently added '[gerrit.mgm.sipwise.com]:29418' (ED25519) to the list of known hosts.
| sign_and_send_pubkey: no mutual signature supported
| puppet-r10k@gerrit.mgm.sipwise.com: Permission denied (publickey).
| fatal: Could not read from remote repository.
Change-Id: I5894170dab033d52a2612beea7b6f27ab06cc586
Deploying the Debian/bookworm based NGCP system fails on a Lenovo sr250
v2 node with an Intel E810 network card:
| # lshw -c net -businfo
| Bus info Device Class Description
| =======================================================
| pci@0000:01:00.0 eth0 network Ethernet Controller E810-XXV for SFP
| pci@0000:01:00.1 eth1 network Ethernet Controller E810-XXV for SFP
| # lshw -c net
| *-network:0
| description: Ethernet interface
| product: Ethernet Controller E810-XXV for SFP
| vendor: Intel Corporation
| physical id: 0
| bus info: pci@0000:01:00.0
| logical name: eth0
| version: 02
| serial: [...]
| size: 10Gbit/s
| capacity: 25Gbit/s
| width: 64 bits
| clock: 33MHz
| capabilities: pm msi msix pciexpress vpd bus_master cap_list rom ethernet physical fibre 1000bt-fd 25000bt-fd
| configuration: autonegotiation=off broadcast=yes driver=ice driverversion=1.11.14 duplex=full firmware=2.25 0x80007027 1.2934.0 ip=192.168.90.51 latency=0 link=yes multicast=yes port=fibre speed=10Gbit/s
| resources: iomemory:400-3ff iomemory:400-3ff irq:16 memory:4002000000-4003ffffff memory:4006010000-400601ffff memory:a1d00000-a1dfffff memory:4005000000-4005ffffff memory:4006220000-400641ffff
We set up the /etc/network/interfaces file by invoking Grml's
netcardconfig script in automated mode, like:
NET_DEV=eth0 METHOD=static IPADDR=192.168.90.51 NETMASK=255.255.255.248 GATEWAY=192.168.90.49 /usr/sbin/netcardconfig
The resulting /etc/network/interfaces gets used as base for usage inside
the NGCP chroot/target system. netcardconfig shuts down the network
interface (eth0 in the example above) via ifdown, then sleeps for 3
seconds and re-enables the interface (via ifup) with the new
configuration.
This used to work fine so far, but with the Intel e810 network card and
kernel version 6.1.0-9-amd64 from Debian/bookworm we see a link failure
and it takes ~10 seconds until the network device is up and running
again. The following vagrant_configuration() execution from
deployment.sh then fails:
| +11:41:01 (netscript.grml:1022): vagrant_configuration(): wget -O /var/tmp/id_rsa_sipwise.pub http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub
| --2023-06-11 11:41:01-- http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub
| Resolving builder.mgm.sipwise.com (builder.mgm.sipwise.com)... failed: Name or service not known.
| wget: unable to resolve host address 'builder.mgm.sipwise.com'
However, when we retry it again just a bit later, the network works fine
again. During investigation we identified that the network card flips
the port, quoting the related log from the connected Cisco nexus 5020
switch (with fast stp learning mode):
| nexus5k %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/33 is down (Link failure)
It seems to be related to some autonegotiation problem, as when we
execute `ethtool -A eth0 rx on tx on` (no matter whether with `on` or
`off`), we see:
| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: Autoneg did not complete so changing settings may not result in an actual change.
| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: NIC Link is Down
| [Tue Jun 13 08:51:45 2023] ice 0000:01:00.0 eth0: NIC Link is up 10 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: NONE, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: Rx/Tx
FTR:
| root@sp1 ~ # ethtool -A eth0 autoneg off
| netlink error: Operation not supported
| 76 root@sp1 ~ # ethtool eth0 | grep -C1 Auto-negotiation
| Duplex: Full
| Auto-negotiation: off
| Port: FIBRE
| root@sp1 ~ # ethtool -A eth0 autoneg on
| root@sp1 ~ # ethtool eth0 | grep -C1 Auto-negotiation
| Duplex: Full
| Auto-negotiation: off
| Port: FIBRE
| root@sp1 ~ # dmesg -T | tail -1
| [Tue Jun 13 08:53:26 2023] ice 0000:01:00.0 eth0: To change autoneg please use: ethtool -s <dev> autoneg <on|off>
| root@sp1 ~ # ethtool -s eth0 autoneg off
| root@sp1 ~ # ethtool -s eth0 autoneg on
| netlink error: link settings update failed
| netlink error: Operation not supported
| 75 root@sp1 ~ #
As a workaround, at least until we have a better fix/solution, we try to
reach the default gateway (or fall back to the repository host if
gateway couldn't be identified) via ICMP/ping, and once that works we we
continue as usual. But even if that should fail we continue execution,
to minimize behavior change but have a workaround for this specific
situation available.
FTR, broken system:
| root@sp1 ~ # ethtool -i eth0
| driver: ice
| version: 6.1.0-9-amd64
| firmware-version: 2.25 0x80007027 1.2934.0
| [...]
Whereas with kernel 5.10.0-23-amd64 from Debian/bullseye we don't seem
to see that behavior:
| root@sp1:~# ethtool -i neth0
| driver: ice
| version: 5.10.0-23-amd64
| firmware-version: 2.25 0x80007027 1.2934.0
| [...]
Also using latest available ice v1.11.14 (from
https://sourceforge.net/projects/e1000/files/ice%20stable/1.11.14/)
on Kernel version 6.1.0-9-amd64 doesn't bring any change:
| root@sp1 ~ # modinfo ice
| filename: /lib/modules/6.1.0-9-amd64/updates/drivers/net/ethernet/intel/ice/ice.ko
| firmware: intel/ice/ddp/ice.pkg
| version: 1.11.14
| license: GPL v2
| description: Intel(R) Ethernet Connection E800 Series Linux Driver
| author: Intel Corporation, <linux.nics@intel.com>
| srcversion: 818E9C817731C98A25470C0
| alias: pci:v00008086d00001888sv*sd*bc*sc*i*
| [...]
| alias: pci:v00008086d00001591sv*sd*bc*sc*i*
| depends: ptp
| retpoline: Y
| name: ice
| vermagic: 6.1.0-9-amd64 SMP preempt mod_unload modversions
| parm: debug:netif level (0=none,...,16=all) (int)
| parm: fwlog_level:FW event level to log. All levels <= to the specified value are enabled. Values: 0=none, 1=error, 2=warning, 3=normal, 4=verbose. Invalid values: >=5
| (ushort)
| parm: fwlog_events:FW events to log (32-bit mask)
| (ulong)
| root@sp1 ~ # ethtool -i eth0 | head -3
| driver: ice
| version: 1.11.14
| firmware-version: 2.25 0x80007027 1.2934.0
| root@sp1 ~ #
Change-Id: Ieafe648be4e06ed0d936611ebaf8ee54266b6f3c
Re-reading of disks fails if the mdadm SW-RAID device is still active:
| root@sp1 ~ # cat /proc/mdstat
| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
| md0 : active raid1 sdb3[1] sda3[0]
| 468218880 blocks super 1.2 [2/2] [UU]
| [========>............] resync = 42.2% (197855168/468218880) finish=22.4min speed=200756K/sec
| bitmap: 3/4 pages [12KB], 65536KB chunk
|
| unused devices: <none>
| root@sp1 ~ # blockdev --rereadpt /dev/sdb
| blockdev: ioctl error on BLKRRPART: Device or resource busy
| 1 root@sp1 ~ # blockdev --rereadpt /dev/sda
| blockdev: ioctl error on BLKRRPART: Device or resource busy
| 1 root@sp1 ~ #
Only if we stop the mdadm SW-RAID device, then we can re-read the
partition table:
| root@sp1 ~ # mdadm --stop /dev/md0
| mdadm: stopped /dev/md0
| root@sp1 ~ # blockdev --rereadpt /dev/sda
| root@sp1 ~ #
This behavior isn't new and unrelated to Debian/bookworm but was spotted
while debugging an unrelated issue.
FTR: we re-read the partition table (via `blockdev --rereadpt`) to ensure
that /etc/fstab of the live system is up2date and matches the current
system state. While this isn't stricly needed, we preserve existing
behavior and also try to avoid a hard "cut" of a possibly ongoing
SW-RAID sync.
Change-Id: I735b00423e6efa932f74b78a38ed023576e5d306
With our newer Grml-Sipwise ISO (v2023-06-01) being based on
Debian/bookworm and recent Grml packages, our automated deployment
suddenly started to fail for us:
| +04:28:12 (netscript.grml:2453): echo 'Successfully finished deployment process [Fri Jun 2 04:28:12 UTC 2023 - running 576 seconds]'
| ++04:28:12 (netscript.grml:2455): get_deploy_status
| ++04:28:12 (netscript.grml:95): get_deploy_status(): '[' -r /srv/deployment//status ']'
| ++04:28:12 (netscript.grml:96): get_deploy_status(): cat /srv/deployment//status
| Successfully finished deployment process [Fri Jun 2 04:28:12 UTC 2023 - running 576 seconds]
| +04:28:12 (netscript.grml:2455): '[' copylogfiles '!=' error ']'
| +04:28:12 (netscript.grml:2456): set_deploy_status finished
| +04:28:12 (netscript.grml:103): set_deploy_status(): '[' -n finished ']'
| +04:28:12 (netscript.grml:104): set_deploy_status(): echo finished
| +04:28:12 (netscript.grml:2459): false
| +04:28:12 (netscript.grml:2463): status_wait
| +04:28:12 (netscript.grml:329): status_wait(): [[ -n 0 ]]
| +04:28:12 (netscript.grml:329): status_wait(): [[ 0 != 0 ]]
| +04:28:12 (netscript.grml:2466): false
| +04:28:12 (netscript.grml:2471): false
| +04:28:12 (netscript.grml:2476): echo 'Do you want to [r]eboot or [h]alt the system now? (Press any other key to cancel.)'
| Do you want to [r]eboot or [h]alt the system now? (Press any other key to cancel.)
| +04:28:12 (netscript.grml:2477): unset a
| +04:28:12 (netscript.grml:2478): read -r a
| ++04:28:12 (netscript.grml:2478): wait_exit
| ++04:28:12 (netscript.grml:339): wait_exit(): local e_code=1
| ++04:28:12 (netscript.grml:340): wait_exit(): [[ 1 -ne 0 ]]
| ++04:28:12 (netscript.grml:341): wait_exit(): set_deploy_status error
| ++04:28:12 (netscript.grml:103): set_deploy_status(): '[' -n error ']'
| ++04:28:12 (netscript.grml:104): set_deploy_status(): echo error
| ++04:28:12 (netscript.grml:343): wait_exit(): trap '' 1 2 3 6 15 ERR EXIT
| ++04:28:12 (netscript.grml:344): wait_exit(): status_wait
| ++04:28:12 (netscript.grml:329): status_wait(): [[ -n 0 ]]
| ++04:28:12 (netscript.grml:329): status_wait(): [[ 0 != 0 ]]
| ++04:28:12 (netscript.grml:345): wait_exit(): exit 1
As of grml-autoconfig v0.20.3 and newer, the grml-autoconfig systemd service
that invokes the deployment netscript uses `StandardInput=null` instead of
`StandardInput=tty` (see https://github.com/grml/grml/issues/176).
Thanks to this, a logic error in our deployment script showed up. We
exit the script in interactive mode, though only *afterwards* prompting
for reboot/halt with `read -r a` - which of course fails if stdin is
missing. As a result, we end up in our signal handler `trap 'wait_exit;'
1 2 3 6 15 ERR EXIT` and then fail the deployment.
So instead prompt for "Do you want to [r]eboot or [h]alt ..." *only* in
interactive mode, and while at it drop the "if "$INTERACTIVE" ; then
exit 0 ; fi" so the prompt is actually presented to the user.
Change-Id: Ia89beaf3c446f3701cc30ab21cfdff7b5808a6d3
Manual execution of python's http.server has multiple drawbacks, like no
proper logging and no service tracking/restart options, but most notably
the deployment status server no longer runs when our deployment script
fails.
While /srv/deployment/status then still might contain "error", no one is
serving that information on port 4242 any longer[1], and our
daily-build-install-vm Jenkins job might then report:
| VM '192.168.209.162' current state is '' - retrying up to another 1646 times, sleeping for a second
| VM '192.168.209.162' current state is '' - retrying up to another 1645 times, sleeping for a second
| [...]
It then runss for ~1/2 hour without doing anything useful, until the
Jenkins job itself gives up.
By running our deployment status server under systemd, we keep the
service alive also when the deployment script terminates. In case of
errors we get immediate feedback:
| VM '192.168.209.162' current state is 'puppet' - retrying up to another 1648 times, sleeping for a second
| VM '192.168.209.162' current state is 'puppet' - retrying up to another 1647 times, sleeping for a second
| VM '192.168.209.162' current state is 'error' - retrying up to another 1646 times, sleeping for a second
| + '[' error '!=' finished ']'
| + echo 'Failed to install Proxom VM '\''162'\'' (IP '\''192.168.209.162'\'')'
[1] For our NGCP based installations we use the ngcpstatus boot option,
where its status_wait trap kicks in and avoids premature exit of
deployment status server. But e.g. our non-NGCP systems don't use that
boot option and with this change we could get rid of the status_wait
overall.
Change-Id: Ibaa799358caedf31c64c37b48e3c5e889808086a
Packages like 'firmware-linux', 'firmware-linux-nonfree',
'firmware-misc-nonfree' and further 'firmware-*' got moved from non-free
to the new non-free-firmware component/repository (related to
https://www.debian.org/vote/2022/vote_003).
grml-live v0.43.0 provides supports for this new component, so let's
make sure we have proper support for firmware related packages by
updating to the corresponding grml-live version.
Change-Id: I4704e8be051ab6b5496021f07f42208b34963739
Use system-tools' ngcp-initialize-udev-rules-net script to
deploy the /etc/udev/rules.d/70-persistent-net.rules, no need
to maintain code at multiple places.
Change-Id: I81925262a8c687aa9976cbc1113568989fa53281
When building our Debian boxes for buster, bullseye + bookworm (via
daily-build-matrix-debian-boxes Jenkins job), we get broken networking,
so e.g. `vagrant up debian-bookworm doesn't work.
This is caused by /etc/network/interfaces (using e.g. "neth0", being our
naming schema which we use in NGCP, as adjusted by the deployment
script) not matching the actual system network devices (like enp0s3).
TL;DR: no behavior change for NGCP systems, only when building non-NGCP
systems then enable net.ifnames=0 (via set_custom_grub_boot_options),
but do *not* generate /etc/udev/rules.d/70-persistent-net.rules (via
invoke generate_udev_network_rules) nor rename eth*->neth* in
/etc/network/interfaces.
More verbose version:
* rename the "eth*" networking interfaces into "neth*" in
/etc/network/interfaces only when running in ngcp-installer mode
(this is the behavior we rely on in NGCP, but it doesn't matter
for plain Debian systems)
* generate /etc/udev/rules.d/70-persistent-net.rules only when running
in ngcp-installer mode. While our jenkins-configs.git's
jobs/daily-build/scripts/vm_clean-fs.sh removes the file anyways (for
the VM use case), between the initial deployment run and the next reboot
the configuration inside the PVE VM still applies, so we end up with
an existing /etc/udev/rules.d/70-persistent-net.rules, referring to
neth0, while our /etc/network/interfaces configures eth0 instead.
* when *not* running in ngcp-installer mode, enable net.ifnames=0 usage
in GRUB to disable persistent network interface naming. FTR, this
change is *not* needed for NGCP, as on NGCP systems we use
/etc/udev/rules.d/70-persistent-net.rules, generated by
ngcp-system-tools' ngcp-initialize-udev-rules-net script also in VM
use case
This is a fixup for a change in git commit a50903a30c (see also commit
message of git commit ab62171), that should have been adjusted for
ngcp-installer-only mode instead.
Change-Id: I6d0021dbdc2c1587127f0e115c6ff9844460a761
The public name servers resolve deb.sipwise.com to our public OVH IP
address 164.132.119.186, while internally we want to use its cname
haproxy.mgm.sipwise.com. This only works with using our internal
nameservers (like 192.168.212.30 and 192.168.88.20).
Default to 192.168.212.30, so deployments work as expected, otherwise
we're failing during deployment with:
| Err:5 https://deb.sipwise.com/autobuild release-trunk-bookworm InRelease
| 403 Forbidden [IP: 164.132.119.186 443]
While at it also update the ip=... kernel option, to use
168.192.91.XX/24 by default, and also use a FQDN for the hostname (since
that's our current policy for puppet hostname/certificates).
Change-Id: I1ce6541f7a31baa437e679b67056bb7851b1b33d
Relevant changes:
* GRMLBASE/39-modprobe: avoid usage of /lib/modprobe.d/50-nfs.conf
* GRMLBASE/39-modprobe: do not expect all files in /etc/modprobe.d to be used
This gives us working netboot images and avoids sysctl errors during bootup,
if nfs-kernel-server should be present on the ISO.
Change-Id: I0012199658c186b69c45ac51bc249ce75b8d81ce
If the date of the running system isn't appropriate enough, then apt
runs might fail with somehint like:
| E: Release file for https://deb/sipwise/com/spce/mr10.5.2/dists/bullseye/InRelease is not valid yet (invalid for another 6h 19min 2s)
So let's try to sync date/time of the system via NTP. Given that chrony
is a small (only 650 kB disk space) and secure replacement for ntp,
let's ship chrony with the Grml deployment ISO (and fall back to ntp
usage in deployment script if chrony shouldn't be available).
Also, if the system is configured to read the RTC time in the local time
zone, this is known as another source of problems, so let's make sure to
use the RTC in UTC.
Change-Id: I747665d1cee3b6f835c62812157d0203bcfa96e2
For deploying Debian/bookworm (see MT#55524), we'd like to have an
updated Grml ISO. With such a Debian/bookworm based live system, we can
still deploy older target systems (like Debian/bullseye).
Relevant changes:
1) Ad jo as new build-dependency, to generate build information in
conf/buildinfo.json (new dependency of grml-live)
2) Always include ca-certificates, as this is required with more recent
mmdebstrap versions (>=0.8.0), when using apt repositories with
https, otherwise bootstrapping Debian fails.
3) Update to latest stable grml-live version v0.42.0, which:
a) added support for "bookworm" as suite name
cff66073a7
b) provides corresponding templates for memtest support:
c01a86b3fc
c) and a workaround for a kmod/initramfs-tools issue with PXE/NFS boot:
ea1e5ea330
4) Update memtest86+ to v6.00-1 as present in Debian/bookworm and
add corresponding UEFI support (based on grml-live's upstream change,
though as we don't support i386, dropped the 32bit related bits)
Change-Id: I327c0e25c28f46e097212ef4329d75fc8d34767c
We build the pre-loaded library targeting a specific Debian release,
which might be different (and newer) to the release Grml was built for.
This can cause missing versioned symbols (and a loading failure) if the
libc in the outer system is older than the inner system.
Change-Id: I84f4f307863e534fe0fff85274ae1d5db809012c
Git commit 6661b04af0 broke all our bullseye based builds
(debian, sipwise + docker), see
https://jenkins.mgm.sipwise.com/view/All/job/daily-build-matrix-debian-boxes/
For plain Debian installations we don't have SP_VERSION available,
so default to what was used before supporting trunk-weekly next
to trunk.
Change-Id: I61958f0c67d165d2f6dcb059fe4991ed24a328c9