deployment-iso

Commit Graph

Author	SHA1	Message	Date
Sipwise Jenkins Builder	14a7473db2	Release new version 12.4.1.2+0~mr12.4.1.2	1 year ago
Michael Prokop	e8c0d23678	MT#60284 Ensure to start qemu-guest-agent only after package got installed We install the qemu-guest-agent package in ensure_packages_installed(). Try to start the qemu-guest-agent service only afterwards therefore. Fixup for commit `82e6638b40` Change-Id: Ic4aa2e493851b4c92ac134d68a9a76e05485658d (cherry picked from commit `cf94193f88`) (cherry picked from commit `6d3c733314`)	1 year ago
Michael Prokop	3f5cc5bf07	MT#60284 Only check whether /dev/virtio-ports/org.qemu.guest_agent.0 exists /dev/virtio-ports/org.qemu.guest_agent.0 usually is a symlink to the character device /dev/vport1p1. So adjust the device check accordingly and only verify it exists, but don't expected any special file type. This actually matches the behavior we also have in ngcp-installer. Fixup for commit `82e6638b40` Change-Id: I0aa93c1f0e1086847eb7ed6967692a52e183bdc3 (cherry picked from commit `4a292ab4be`) (cherry picked from commit `2e674fe092`)	1 year ago
Michael Prokop	7d268d4718	MT#60284 Make sure qemu-guest-agent is available Now that we enabled the QEMU Guest Agent option for our PVE VMs, we need to have qemu-guest-agent present and active. Otherwise the VMs might fail to shut down, like with our debian/sipwise/docker Debian systems which are created via https://jenkins.mgm.sipwise.com/job/daily-build-matrix-debian-boxes/: \| [proxmox-vm-shutdown] $ /bin/sh -e /tmp/env-proxmox-vm-shutdown7956268380939677154.sh \| [environment-script] Adding variable 'vm1reset' with value 'NO' \| [environment-script] Adding variable 'vm2' with value 'none' \| [environment-script] Adding variable 'vm1' with value 'none' \| [environment-script] Adding variable 'vm2reset' with value 'NO' \| [proxmox-vm-shutdown] $ /bin/bash /tmp/jenkins14192704603218787414.sh \| Using safe VM 'shutdown' for modern releases (mr6.5+). Executing action 'shutdown'... \| Shutting down VM 106 \| Build timed out (after 10 minutes). Marking the build as aborted. \| Build was aborted \| [WS-CLEANUP] Deleting project workspace... Let's make sure qemu-guest-agent is available in our Grml live system. We added qemu-guest-agent to the package list of our Grml Sipwise ISO (see git rev `65c3fea4c`), but to ensure we don't strictly depend on this brand new Grml Sipwise ISO yet, make sure to install it on-the-fly if not yet present (like we already did for git, augeas-tools + gdisk). Also make sure qemu-guest-agent service is enabled if socket /dev/virtio-ports/org.qemu.guest_agent.0 is present (indicating that the agent feature is enabled on VM level). Furthermore ensure qemu-guest-agent is present also in the installed Debian system. Otherwise when rebooting the VM once it's no longer running the Grml live system but the installed Debian system, it might also fail to shutdown. So add it to the default package list of packages for bootstrapping. Change-Id: Id6adac55a47cfaed542cad2f9ac9740783e6d924 (cherry picked from commit `82e6638b40`) (cherry picked from commit `b00792606c`)	1 year ago
Sipwise Jenkins Builder	952b096f6e	Release new version 12.4.1.1+0~mr12.4.1.1	1 year ago
Sipwise Jenkins Builder	4f28fe25e7	Release new version 12.4.1.0+0~mr12.4.1.0	1 year ago
Mykola Malkov	6cf4786735	MT#59872 Remove NGCP_PXE_INSTALL variable With this variable we had some tricks in ngcp-initial-configuration if the Pro sp2 node is installer via iPXE/cm image. Now we support installation of sp2 via iPXE only so no need to pass this variable. But we need to keep parent ngcppxeinstall parameter as we need this information for netcardconfig. Change-Id: I20491289917cbb427ad6f5670f108c632838be71	1 year ago
Mykola Malkov	0fb8327415	MT#59872 Remove Pro sp2 from boot menu We are dropping the scenario when sp2 node is installed from cd image so remove appropriate part of the code. Change-Id: Idced6b43a21add903dca070aa68f84b77acba28e	1 year ago
Guillem Jover	0a91a49826	MT#58014 Remove support for fetching OpenPGP certificates from keyservers The code trying to fetch the OpenPGP certificate from a keyserver has been non-functional for a while as the GPG_KEY_SERVER variable was removed in commit `316c28bcc2`. Instead of restoring the variable with an up-to-date keyserver (not part of the SKS pool, as that network is dead), we remove the support entirely as it's a potential security issue due to fingerprint collisions for example. As a side effect this removes apt-key usage which has been deprecated upstream and is slated for removal. Change-Id: I63171a66201c631da9233d54579bd1601ff22e3e	1 year ago
Sipwise Jenkins Builder	362f7cbea1	Release new version 12.4.0.0+0~mr12.4.0.0	1 year ago
Michael Prokop	e99f33e11a	TT#118659 Do not fail when deploying SW-RAID if no RAID was present yet Followup fix for commit `fc9b43f92e` (Fix re-deploying over existing SW-RAID arrays). We try to detect present SW-RAIDs and identify the disks which are part of the RAID array, to be able to properly reset them then. Though if we don't find such an existing SW-RAID array the orig_swraid_device variable stays unset and our deployments with SW-RAID fails now, as observed on carrier-sp1-trunk: \| root@carrier-sp1-trunk ~ # tail -20 /tmp/deployment-installer-debug.log \| ++02:00:04 (netscript.grml:620): set_up_partition_table_swraid(): head -1 \| ++02:00:04 (netscript.grml:620): set_up_partition_table_swraid(): lsblk --list --noheadings --output TYPE,NAME \| Sleeping for 10 seconds (as requested via boot option 'ngcpstatus') \| +02:00:04 (netscript.grml:620): set_up_partition_table_swraid(): raid_device= \| +02:00:04 (netscript.grml:623): set_up_partition_table_swraid(): [[ -n '' ]] \| +02:00:04 (netscript.grml:645): set_up_partition_table_swraid(): [[ -b /dev/md0 ]] \| /tmp/netscript.grml: line 669: orig_swraid_device: unbound variable \| ++02:00:04 (netscript.grml:1): set_up_partition_table_swraid(): wait_exit \| ++02:00:04 (netscript.grml:339): wait_exit(): local e_code=1 \| ++02:00:04 (netscript.grml:340): wait_exit(): [[ 1 -ne 0 ]] \| ++02:00:04 (netscript.grml:341): wait_exit(): set_deploy_status error \| ++02:00:04 (netscript.grml:103): set_deploy_status(): '[' -n error ']' \| ++02:00:04 (netscript.grml:104): set_deploy_status(): echo error \| ++02:00:04 (netscript.grml:343): wait_exit(): trap '' 1 2 3 6 15 ERR EXIT \| ++02:00:04 (netscript.grml:344): wait_exit(): status_wait \| ++02:00:04 (netscript.grml:329): status_wait(): [[ -n 10 ]] \| ++02:00:04 (netscript.grml:329): status_wait(): [[ 10 != 0 ]] \| ++02:00:04 (netscript.grml:333): status_wait(): echo 'Sleeping for 10 seconds (as requested via boot option '\''ngcpstatus'\'')' \| ++02:00:04 (netscript.grml:334): status_wait(): sleep 10 \| ++02:00:14 (netscript.grml:345): wait_exit(): exit 1 FTR: \| root@carrier-sp1-trunk ~ # cat /proc/cmdline \| BOOT_IMAGE=vmlinuz initrd=initrd.img fetch=http://builder6.mgm.sipwise.com:3000/ngcp-pxe-boot-sipwise20230915/fs/grml64-small/grml64-small.squashfs boot=live ignore_bootid apm=power-off nomce net.ifnames=0 noprompt noeject vga=791 ssh=sipwise ethdevice=eth0 ethdevice-timeout=30 live-netdev=eth0 netscript=http://deb.sipwise.com/netscript/master/deployment.sh debianrelease=bookworm lowperformance enablevmservices debugmode ngcpvers=trunk ngcpnoupload ngcppro ngcpsp1 ngcphostname=web01a ngcpcrole=mgmt ngcpnonwrecfg dns=1.1.1.1,1.0.0.1 ngcpeaddr=192.168.209.180 ip=192.168.209.180::192.168.209.1:255.255.255.0:sp1:eth0:off vagrant swraiddisk1=sda swraiddisk2=sdb ngcpnodename=sp1 ngcpstatus=10 swapfilesize=2048M rootfssize=8G fallbackfssize=10M \| \| root@carrier-sp1-trunk ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| unused devices: <none> \| \| root@carrier-sp1-trunk ~ # lsblk \| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS \| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/grml64-small.squashfs \| /run/live/rootfs/grml64-small.squashfs \| sda 8:0 0 16G 0 disk \| sdb 8:16 0 16G 0 disk \| sr0 11:0 1 1024M 0 rom Change-Id: I2329aaa0754674b5d192a174b644900f09f9db84	1 year ago
Michael Prokop	1d59d89d04	TT#118659 Do not abort on disk partition listing failures We identify any existing partitions of the disk we need to wipe via: \| root@license42 ~ # lsblk --noheadings --output KNAME /dev/sda \| sda \| sda1 \| sda2 \| sda3 \| root@license42 ~ # blockdevice="/dev/sda" \| root@license42 ~ # lsblk --noheadings --output KNAME /dev/sda \| grep -v "^${blockdevice#\/dev\/}$" \| sda1 \| sda2 \| sda3 This might fail though, if there are no partitions present: \| root@license42 ~ # dd if=/dev/zero of=/dev/sda bs=10M count=1 \| 1+0 records in \| 1+0 records out \| 10485760 bytes (10 MB, 10 MiB) copied, 0.0487036 s, 215 MB/s \| root@license42 ~ # pvremove /dev/sda --force --force --yes \| Labels on physical volume "/dev/sda" successfully wiped. \| root@license42 ~ # blockdevice="/dev/sda" \| root@license42 ~ # lsblk --noheadings --output KNAME /dev/sda \| grep -v "^${blockdevice#\/dev\/}$" \| 1 root@license42 ~ # Ending up in our daily-build-install-vm Jenkins jobs like this: \| +13:08:19 (netscript.grml:489): clear_partition_table(): echo 'Removing possibly existing LVM/PV label from /dev/sda' \| +13:08:19 (netscript.grml:490): clear_partition_table(): pvremove /dev/sda --force --force --yes \| Labels on physical volume "/dev/sda" successfully wiped. \| ++13:08:19 (netscript.grml:495): clear_partition_table(): grep -v '^sda$' \| ++13:08:19 (netscript.grml:495): clear_partition_table(): lsblk --noheadings --output KNAME /dev/sda \| +++13:08:19 (netscript.grml:495): clear_partition_table(): wait_exit \| +++13:08:19 (netscript.grml:339): wait_exit(): local e_code=1 \| +++13:08:19 (netscript.grml:340): wait_exit(): [[ 1 -ne 0 ]] \| +++13:08:19 (netscript.grml:341): wait_exit(): set_deploy_status error \| +++13:08:19 (netscript.grml:103): set_deploy_status(): '[' -n error ']' \| +++13:08:19 (netscript.grml:104): set_deploy_status(): echo error \| Wiping disk signatures from /dev/sda \| +++13:08:19 (netscript.grml:343): wait_exit(): trap '' 1 2 3 6 15 ERR EXIT \| +++13:08:19 (netscript.grml:344): wait_exit(): status_wait \| +++13:08:19 (netscript.grml:329): status_wait(): [[ -n 0 ]] \| +++13:08:19 (netscript.grml:329): status_wait(): [[ 0 != 0 ]] Followup change for `e9244a289b`, to fix failing VM deployments. Change-Id: Ic63ecf5dd090722705473ad5aac289473b082650	1 year ago
Michael Prokop	fc9b43f92e	TT#118659 Fix re-deploying over existing SW-RAID arrays Fresh deployments with SW-RAID (Software-RAID) might fail if the present disks were already part of an SW-RAID setup: \| Error: disk nvme1n1 seems to be part of an existing SW-RAID setup. We could also reproduce this inside PVE VMs: \| mdadm: /dev/md/127 has been started with 2 drives. \| Error: disk sda seems to be part of an existing SW-RAID setup. This is caused by the following behavior: \| + SWRAID_DEVICE="/dev/md0" \| [...] \| + mdadm --assemble --scan \| + true \| + [[ -b /dev/md0 ]] \| + for disk in "${SWRAID_DISK1}" "${SWRAID_DISK2}" \| + grep -q nvme1n1 /proc/mdstat \| + die 'Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.' \| + echo 'Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.' \| Error: disk nvme1n1 seems to be part of an existing SW-RAID setup. By default we expect and set the SWRAID_DEVICE to be /dev/md0. But only "local" arrays get assembled as /dev/md0 and upwards, whereas "foreign" arrays start at md127 downwards. This is exactly what we get when booting our deployment live system on top of an existing installation, and assemble existing SW-RAIDs (to not overwrite unexpected disks by mistake): \| root@grml ~ # lsblk \| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS \| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/ngcp.squashfs \| /run/live/rootfs/ngcp.squashfs \| nvme0n1 259:0 0 447.1G 0 disk \| └─md127 9:127 0 447.1G 0 raid1 \| ├─md127p1 259:14 0 18G 0 part \| ├─md127p2 259:15 0 18G 0 part \| ├─md127p3 259:16 0 405.6G 0 part \| ├─md127p4 259:17 0 512M 0 part \| ├─md127p5 259:18 0 4G 0 part \| └─md127p6 259:19 0 1G 0 part \| nvme1n1 259:7 0 447.1G 0 disk \| └─md127 9:127 0 447.1G 0 raid1 \| ├─md127p1 259:14 0 18G 0 part \| ├─md127p2 259:15 0 18G 0 part \| ├─md127p3 259:16 0 405.6G 0 part \| ├─md127p4 259:17 0 512M 0 part \| ├─md127p5 259:18 0 4G 0 part \| └─md127p6 259:19 0 1G 0 part \| \| root@grml ~ # lsblk -l -n -o TYPE,NAME \| loop loop0 \| raid1 md127 \| disk nvme0n1 \| disk nvme1n1 \| part md127p1 \| part md127p2 \| part md127p3 \| part md127p4 \| part md127p5 \| part md127p6 \| \| root@grml ~ # cat /proc/cmdline \| vmlinuz initrd=initrd.img swraiddestroy swraiddisk2=nvme0n1 swraiddisk1=nvme1n1 [...] Let's identify existing RAID devices and check their configuration by going through the disks and comparing them with our SWRAID_DISK1 and SWRAID_DISK2. If they don't match with each other, we stop execution to prevent any possible data damage. Furthermore, we need to assemble the mdadm array without relying on a possibly existing local `/etc/mdadm/mdadm.conf` configuration file. Otherwise assembling might fail: \| root@grml ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| unused devices: <none> \| root@grml ~ # lsblk -l -n -o TYPE,NAME \| awk '/^raid/ {print $2}' \| root@grml ~ # grep ARRAY /etc/mdadm/mdadm.conf \| ARRAY /dev/md/127 metadata=1.0 UUID=0d44774e:7269bac6:2f02f337:4551597b name=localhost:127 \| root@grml ~ # mdadm --assemble --scan \| 2 root@grml ~ # mdadm --assemble --scan --verbose \| mdadm: looking for devices for /dev/md/127 \| mdadm: No super block found on /dev/loop0 (Expected magic a92b4efc, got 800989c0) \| mdadm: no RAID superblock on /dev/loop0 \| mdadm: No super block found on /dev/nvme1n1p3 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme1n1p3 \| mdadm: No super block found on /dev/nvme1n1p2 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme1n1p2 \| mdadm: No super block found on /dev/nvme1n1p1 (Expected magic a92b4efc, got 000080fe) \| mdadm: no RAID superblock on /dev/nvme1n1p1 \| mdadm: No super block found on /dev/nvme1n1 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme1n1 \| mdadm: No super block found on /dev/nvme0n1p3 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme0n1p3 \| mdadm: No super block found on /dev/nvme0n1p2 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme0n1p2 \| mdadm: No super block found on /dev/nvme0n1p1 (Expected magic a92b4efc, got 000080fe) \| mdadm: no RAID superblock on /dev/nvme0n1p1 \| mdadm: No super block found on /dev/nvme0n1 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme0n1 \| 2 root@grml ~ # mdadm --assemble --scan --config /dev/null \| mdadm: /dev/md/grml:127 has been started with 2 drives. \| root@grml ~ # lsblk -l -n -o TYPE,NAME \| awk '/^raid/ {print $2}' \| md127 By running mdadm assemble with `--config /dev/null`, we prevent consideration and usage of a possibly existing /etc/mdadm/mdadm.conf configuration file. Example output of running the new code: \| [...] \| mdadm: No arrays found in config file or automatically \| NOTE: default SWRAID_DEVICE set to /dev/md0 though we identified active md127 \| NOTE: will continue with '/dev/md127' as SWRAID_DEVICE for mdadm cleanup \| Wiping signatures from /dev/md127 \| /dev/md127: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31 \| Removing mdadm device /dev/md127 \| Stopping mdadm device /dev/md127 \| mdadm: stopped /dev/md127 \| Zero-ing superblock from /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| Zero-ing superblock from /dev/nvme0n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| NOTE: modified RAID array detected, setting SWRAID_DEVICE back to original setting '/dev/md0' \| Removing possibly existing LVM/PV label from /dev/nvme1n1 \| Cannot use /dev/nvme1n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme1n1p1 \| Cannot use /dev/nvme1n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme1n1p2 \| Labels on physical volume "/dev/nvme1n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme1n1p3 \| Cannot use /dev/nvme1n1p3: device is an md component \| Wiping disk signatures from /dev/nvme1n1 \| /dev/nvme1n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme1n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0027866 s, 376 MB/s \| Removing possibly existing LVM/PV label from /dev/nvme0n1 \| Cannot use /dev/nvme0n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme0n1p1 \| Cannot use /dev/nvme0n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme0n1p2 \| Labels on physical volume "/dev/nvme0n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme0n1p3 \| Cannot use /dev/nvme0n1p3: device is an md component \| Wiping disk signatures from /dev/nvme0n1 \| /dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme0n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00278955 s, 376 MB/s \| Creating partition table \| Get path of EFI partition \| pvdevice is now available: /dev/nvme1n1p2 \| The operation has completed successfully. \| The operation has completed successfully. \| pvdevice is now available: /dev/nvme1n1p3 \| pvdevice is now available: /dev/nvme0n1p3 \| mdadm: /dev/nvme1n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Jan 24 10:31:43 2024 \| mdadm: Note: this array has metadata at the start and \| may not be suitable as a boot device. If you plan to \| store '/boot' on this device please ensure that \| your boot-loader understands md/v1.x metadata, or use \| --metadata=0.90 \| mdadm: /dev/nvme0n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Jan 24 10:31:43 2024 \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Creating PV + VG on /dev/md0 \| Physical volume "/dev/md0" successfully created. \| Volume group "ngcp" successfully created \| 0 logical volume(s) in volume group "ngcp" now active \| Creating LV 'root' with 10G \| [...] \| \| mdadm: stopped /dev/md127 \| mdadm: No arrays found in config file or automatically \| NOTE: will continue with '/dev/md127' as SWRAID_DEVICE for mdadm cleanup \| Removing mdadm device /dev/md127 \| Stopping mdadm device /dev/md127 \| mdadm: stopped /dev/md127 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| mdadm: /dev/nvme1n1p3 appears to be part of a raid array: \| mdadm: Note: this array has metadata at the start and \| mdadm: /dev/nvme0n1p3 appears to be part of a raid array: \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| lvm2 mdadm wget \| Get:1 http://http-proxy.lab.sipwise.com/debian bookworm/main amd64 mdadm amd64 4.2-5 [443 kB] \| Selecting previously unselected package mdadm. \| Preparing to unpack .../0-mdadm_4.2-5_amd64.deb ... \| Unpacking mdadm (4.2-5) ... \| Setting up mdadm (4.2-5) ... \| [...] \| mdadm: stopped /dev/md0 Change-Id: Ib5875248e9c01dd4251bfab2cc4c94daace503fa	1 year ago
Michael Prokop	e9244a289b	TT#118659 Wipe disk signatures more reliably with SW-RAID and NVMe setup Deployed current NGCP trunk on NVMe powered SW-RAID setup failed with: \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Creating PV + VG on /dev/md0 \| Cannot use /dev/md0: device is partitioned This is caused because /dev/md0 still contains partition data, and its nvme1n1p3 also still has disk signature about linux_raid_member. So it's not enough to stop the mdadm array, remove PV/LVM information from the partitions and finally wipe SW-RAID disks /dev/nvme1n1 + /dev/nvme0n1 (example output from such a failing run): \| mdadm: /dev/md/0 has been started with 2 drives. \| mdadm: stopped /dev/md0 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| Removing possibly existing LVM/PV label from /dev/nvme1n1 \| Cannot use /dev/nvme1n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme1n1p1 \| Cannot use /dev/nvme1n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme1n1p2 \| Labels on physical volume "/dev/nvme1n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme1n1p3 \| Cannot use /dev/nvme1n1p3: device is an md component \| Wiping disk signatures from /dev/nvme1n1 \| /dev/nvme1n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme1n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00314195 s, 334 MB/s \| Removing possibly existing LVM/PV label from /dev/nvme0n1 \| Cannot use /dev/nvme0n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme0n1p1 \| Cannot use /dev/nvme0n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme0n1p2 \| Labels on physical volume "/dev/nvme0n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme0n1p3 \| Cannot use /dev/nvme0n1p3: device is an md component \| Wiping disk signatures from /dev/nvme0n1 \| /dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme0n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00893285 s, 117 MB/s \| Creating partition table \| Get path of EFI partition \| pvdevice is now available: /dev/nvme1n1p2 \| The operation has completed successfully. \| The operation has completed successfully. \| pvdevice is now available: /dev/nvme1n1p3 \| pvdevice is now available: /dev/nvme0n1p3 \| mdadm: /dev/nvme1n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Dec 20 20:35:21 2023 \| mdadm: Note: this array has metadata at the start and \| may not be suitable as a boot device. If you plan to \| store '/boot' on this device please ensure that \| your boot-loader understands md/v1.x metadata, or use \| --metadata=0.90 \| mdadm: /dev/nvme0n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Dec 20 20:35:21 2023 \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Creating PV + VG on /dev/md0 \| Cannot use /dev/md0: device is partitioned Instead we also need to wipe signatures from the SW-RAID device (like /dev/md0), only then stop it, ensure we wipe disk signatures also from all the partitions (like /dev/nvme1n1p3) and only then finally remove the disk signatures from the main block device (like /dev/nvme1n1). Example from a successful run with this change: \| root@grml ~ # grep -e mdadm -e Wiping /tmp/deployment-installer-debug.log \| mdadm: /dev/md/0 has been started with 2 drives. \| Wiping signatures from /dev/md0 \| Removing mdadm device /dev/md0 \| Stopping mdadm device /dev/md0 \| mdadm: stopped /dev/md0 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| Wiping disk signatures from partition /dev/nvme1n1p1 \| Wiping disk signatures from partition /dev/nvme1n1p2 \| Wiping disk signatures from partition /dev/nvme1n1p3 \| Wiping disk signatures from /dev/nvme1n1 \| Wiping disk signatures from partition /dev/nvme0n1p1 \| Wiping disk signatures from partition /dev/nvme0n1p2 \| Wiping disk signatures from partition /dev/nvme0n1p3 \| Wiping disk signatures from /dev/nvme0n1 \| mdadm: Note: this array has metadata at the start and \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Wiping ext3 signature on /dev/ngcp/root. \| Wiping ext4 signature on /dev/ngcp/fallback. \| Wiping ext4 signature on /dev/ngcp/data. While at it, be more verbose about the executed steps. FTR, disk and setup information of such a system where we noticed the failure and worked on this change: \| root@grml ~ # fdisk -l \| Disk /dev/nvme0n1: 447.13 GiB, 480103981056 bytes, 937703088 sectors \| Disk model: DELL NVME ISE PE8010 RI M.2 480GB \| Units: sectors of 1 * 512 = 512 bytes \| Sector size (logical/physical): 512 bytes / 512 bytes \| I/O size (minimum/optimal): 512 bytes / 512 bytes \| Disklabel type: gpt \| Disk identifier: 5D296676-52CF-49CF-863A-6D3A3BD0604F \| \| Device Start End Sectors Size Type \| /dev/nvme0n1p1 2048 4095 2048 1M BIOS boot \| /dev/nvme0n1p2 4096 999423 995328 486M EFI System \| /dev/nvme0n1p3 999424 937701375 936701952 446.7G Linux RAID \| \| \| Disk /dev/nvme1n1: 447.13 GiB, 480103981056 bytes, 937703088 sectors \| Disk model: DELL NVME ISE PE8010 RI M.2 480GB \| Units: sectors of 1 * 512 = 512 bytes \| Sector size (logical/physical): 512 bytes / 512 bytes \| I/O size (minimum/optimal): 512 bytes / 512 bytes \| Disklabel type: gpt \| Disk identifier: 9AFA8ACF-D2CD-4224-BA0C-D38A6581D0F9 \| \| Device Start End Sectors Size Type \| /dev/nvme1n1p1 2048 4095 2048 1M BIOS boot \| /dev/nvme1n1p2 4096 999423 995328 486M EFI System \| /dev/nvme1n1p3 999424 937701375 936701952 446.7G Linux RAID \| [...] \| \| root@grml ~ # lsblk \| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS \| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/ngcp.squashfs \| /run/live/rootfs/ngcp.squashfs \| nvme0n1 259:0 0 447.1G 0 disk \| ├─nvme0n1p1 259:5 0 1M 0 part \| ├─nvme0n1p2 259:8 0 486M 0 part \| └─nvme0n1p3 259:9 0 446.7G 0 part \| └─md0 9:0 0 446.5G 0 raid1 \| ├─ngcp-root 253:0 0 10G 0 lvm /mnt \| ├─ngcp-fallback 253:1 0 10G 0 lvm \| └─ngcp-data 253:2 0 383.9G 0 lvm /mnt/ngcp-data \| nvme1n1 259:4 0 447.1G 0 disk \| ├─nvme1n1p1 259:2 0 1M 0 part \| ├─nvme1n1p2 259:6 0 486M 0 part \| └─nvme1n1p3 259:7 0 446.7G 0 part \| └─md0 9:0 0 446.5G 0 raid1 \| ├─ngcp-root 253:0 0 10G 0 lvm /mnt \| ├─ngcp-fallback 253:1 0 10G 0 lvm \| └─ngcp-data 253:2 0 383.9G 0 lvm /mnt/ngcp-data \| \| root@grml ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| md0 : active raid1 nvme0n1p3[1] nvme1n1p3[0] \| 468218880 blocks super 1.2 [2/2] [UU] \| [==>..................] resync = 12.7% (59516864/468218880) finish=33.1min speed=205685K/sec \| bitmap: 4/4 pages [16KB], 65536KB chunk \| \| unused devices: <none> Change-Id: Iaa7f49eef11ef6ad6209fe962bb8940a75a87c95	1 year ago
Sipwise Jenkins Builder	76893e3acb	Release new version 12.3.0.0+0~mr12.3.0.0	1 year ago
Michael Prokop	236cb2d1a7	MT#58926 Vagrant: ensure to have libxmu6 available We get the following error message in /var/log/vboxadd-install.log, /var/log/deployment-installer-debug.log, /var/log/daemon.log + /var/log/syslog: \| /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient: error while loading shared libraries: libXmu.so.6: cannot open shared object file: No such file or directory This is caused by missing libxmu6: \| [sipwise-lab-trunk] sipwise@spce:~$ /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient --help \| /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient: error while loading shared libraries: libXmu.so.6: cannot open shared object file: No such file or directory \| [sipwise-lab-trunk] sipwise@spce:~$ sudo apt install libxmu6 \| Reading package lists... Done \| Building dependency tree... Done \| Reading state information... Done \| The following NEW packages will be installed: \| libxmu6 \| 0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded. \| Need to get 60.1 kB of archives. \| After this operation, 143 kB of additional disk space will be used. \| Get:1 https://debian.sipwise.com/debian bookworm/main amd64 libxmu6 amd64 2:1.1.3-3 [60.1 kB] \| Fetched 60.1 kB in 0s (199 kB/s) \| [...] \| [sipwise-lab-trunk] sipwise@spce:~$ /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient --help \| Oracle VM VirtualBox VBoxClient 7.0.6 \| Copyright (C) 2005-2023 Oracle and/or its affiliates \| \| Usage: VBoxClient --clipboard\|--draganddrop\|--checkhostversion\|--seamless\|--vmsvga\|--vmsvga-session \| [-d\|--nodaemon] \| \| Options: \| [...] It looks like lack of libxmu6 doesn't cause any actual problems for our use case (we don't use X.org at all), though given that libxmu6 is a small library package, let's try to get it working as expected and avoid the alarming errors on the logs. Thanks Guillem Jover for spotting and reporting Change-Id: I65f3dd496a4026f04fd9944fd7cc43d6abbdf336	2 years ago
Sipwise Jenkins Builder	0f384353f8	Release new version 12.2.0.0+0~mr12.2.0.0	2 years ago
Michael Prokop	8c3ab6b241	MT#57559 Always include zstd when bootstrapping systems During initial deployment of a system, we get warnings about lack of zstd: \| Setting up linux-image-6.1.0-13-amd64 (6.1.55-1) ... \| I: /vmlinuz.old is now a symlink to boot/vmlinuz-6.1.0-13-amd64 \| I: /initrd.img.old is now a symlink to boot/initrd.img-6.1.0-13-amd64 \| I: /vmlinuz is now a symlink to boot/vmlinuz-6.1.0-13-amd64 \| I: /initrd.img is now a symlink to boot/initrd.img-6.1.0-13-amd64 \| /etc/kernel/postinst.d/initramfs-tools: \| update-initramfs: Generating /boot/initrd.img-6.1.0-13-amd64 \| W: No zstd in /usr/bin:/sbin:/bin, using gzip \| [...] The initramfs generation and update overall runs four times within the initial bootstrapping of a system (we'll try to do something about this, but this is outside the scope of this). As of initramfs-tools v0.141, initramfs-tools uses zstd as default compression for initramfs. Version 0.142 is shipped with Debian/bookworm, and therefore it makes sense to have it available upfront. Note that also the initrd generation is faster with zstd (~10sec for zstd vs. ~13sec for gzip) and also the resulting initrd is smaller (~33MB for zstd vs ~39MB for gzip). By making sure that zstd is available straight from the very beginning and before ngcp-installer pulls it in later, we can avoid the warning message but also save >10 seconds of install time. Given that zstd is available even in Debian oldoldstable, let's install it unconditionally in all our systems. Thanks: Volodymyr Fedorov for reporting Change-Id: I56674c3c213f7c7a6e6cbce3c8e2e00a4cfbdbd4	2 years ago
Guillem Jover	9cceb8d655	MT#58356 ntp: Use ntpsec.service instead of ntp.service Even though the ntpsec.service contains an Alias for ntp.service, that does not work for us when the service has not yet been installed, so the first run will fail. Use the actual name to avoid this issue. Change-Id: I8f0ee3b38390a7e58c3bbee65fd96bfd4b717dfa	2 years ago
Sipwise Jenkins Builder	f483c18b82	Release new version 12.1.0.0+0~mr12.1.0.0	2 years ago
Guillem Jover	39949fcd06	MT#58356 Update packaging for bookworm - Add Rules-Requires-Root field. - Switch to Standards-Version 4.6.2. - Update copyright years. Change-Id: Ia24821937c439718750b1832b782cd3832dc9c19	2 years ago
Mykola Malkov	d132ecc4bc	MT#57165 Add ngcp-kernel-firmware package to grml-sipwise It's better to have this package in grml-sipwise image so any system with this network card can use all it's power even in deployment stage. Change-Id: I765efcf446a410a42ef156b2ccc2e6612a33ddd6	2 years ago
Sipwise Jenkins Builder	1239aeab8b	Release new version 12.0.1.0+0~mr12.0.1.0	2 years ago
Sipwise Jenkins Builder	366c412c1f	MT#57980 Add mr11.5 LTS key to bootstrap Now it contains: pub rsa4096 2015-03-05 [SC] [expires: 2029-10-12] 68A702B1FD8E422AAAA1ADA3773236EFF411A836 uid [ unknown] Sipwise GmbH (Sipwise Repository Key) <support@sipwise.com> sub rsa4096 2015-03-05 [E] [expires: 2029-10-12] pub rsa4096 2011-06-06 [SC] F7B8A739CE638D719A078C9859104633EE5E097D uid [ unknown] Sipwise autobuilder (Used to sign packages for autobuild) <development@sipwise.com> sub rsa4096 2011-06-06 [E] pub rsa4096 2022-05-31 [SCEA] [expires: 2032-05-28] 39EB73D5B54870181632E48786C3B4395CB844A2 uid [ unknown] Sipwise autobuilder <development@sipwise.com> pub rsa4096 2023-08-04 [SCEA] [expires: 2033-08-01] F0A595D85C375447BB09F25E34A72CE4979CA98A uid [ unknown] Sipwise autobuilder <development@sipwise.com> pub rsa4096 2021-05-04 [SCEA] [expires: 2031-05-02] AB7FE3DCD53767F6160406442A5CA71B542B9A22 uid [ unknown] Sipwise autobuilder <development@sipwise.com> Change-Id: I33c8a4e666f1a7f8b64d823c3d4e2550ca8dcf11	2 years ago
Michael Prokop	793a93bc43	MT#57453 vagrant_configuration: remove fake systemd presence after execution Let's restore system state of /run/systemd/system for VBoxLinuxAdditions, to avoid any unexpected side effects. Followup for git rev `8601193` Change-Id: I632c7d60ebb627c3a80d4c1f9b264d6d0a13b4f1	2 years ago
Michael Prokop	561303359e	MT#57453 Use tty1 for stdin when running under grml-autoconfig service Recent Grml ISOs, including our Grml-Sipwise ISO (v2023-06-01), include grml-autoconfig v0.20.3 which execute the grml-autoconfig service under `StandardInput=null`. This is necessary to not conflict with tty usage, like used with serial console. See `1e268ffe4f` Now that we run with /dev/null for stdin, we can't interact with the user, so let's try to detect when running from within grml-autoconfig's systemd unit, and if so assume that we're executing on /dev/tty1 and use/reopen that for stdin. Change-Id: Id55283c7f862487a6ef8acb8ab01f67a05bd8dd7	2 years ago
Michael Prokop	8601193128	MT#57453 vagrant_configuration: fake systemd presence As of git rev `6c960afee4` we're using the virtualbox-guest-additions-iso from bookworm. Previous versions of VBoxGuestAdditions had a simple test to check for present of systemd, quoting from /opt/VBoxGuestAdditions-6.1.22/routines.sh: \| use_systemd() \| { \| test ! -f /sbin/init \|\| test -L /sbin/init \| } Now in more recent versions of VBoxGuestAdditions[1], the systemd check was modified, quoting from /opt/VBoxGuestAdditions-7.0.6/routines.sh: \| use_systemd() \| { \| # First condition is what halfway recent systemd uses itself, and the \| # other two checks should cover everything back to v1. \| test -e /run/systemd/system \|\| test -e /sys/fs/cgroup/systemd \|\| test -e /cgroup/systemd \| } So if we're running inside a chroot as with our deployment.sh, it looks like a non-systemd system for VBoxGuestAdditions's installer, and we end up with installation and presence of /etc/init.d/vboxadd, leading to: \| root@spce:~# ls -lah /run/systemd/generator.late/ \| total 4.0K \| drwxr-xr-x 4 root root 100 Jul 18 00:20 . \| drwxr-xr-x 23 root root 580 Jul 18 00:20 .. \| drwxr-xr-x 2 root root 60 Jul 18 00:20 graphical.target.wants \| drwxr-xr-x 2 root root 60 Jul 18 00:20 multi-user.target.wants \| -rw-r--r-- 1 root root 537 Jul 18 00:20 vboxadd.service \| \| root@spce:~# systemctl cat vboxadd.service \| # /run/systemd/generator.late/vboxadd.service \| # Automatically generated by systemd-sysv-generator \| \| [Unit] \| Documentation=man:systemd-sysv-generator(8) \| SourcePath=/etc/init.d/vboxadd \| Description=LSB: VirtualBox Linux Additions kernel modules \| Before=multi-user.target \| Before=multi-user.target \| Before=multi-user.target \| Before=graphical.target \| Before=display-manager.service \| \| [Service] \| Type=forking \| Restart=no \| TimeoutSec=5min \| IgnoreSIGPIPE=no \| KillMode=process \| GuessMainPID=no \| RemainAfterExit=yes \| SuccessExitStatus=5 6 \| ExecStart=/etc/init.d/vboxadd start \| ExecStop=/etc/init.d/vboxadd stop We don't expect any init scripts to be present, as all our services must have systemd unit files. Therefore we check for absence of systemd's /run/systemd/generator.late in our system-tests, which started to fail with the upgrade to VBoxGuestAdditions-v7.0.6 due to the systemd presence detection mentioned above. Let's fake presence of systemd before invoking VBoxGuestAdditions's installer, to avoid ending up with unexpected vbox* init scripts. [1] See svn rev 92682: https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Installer/linux/routines.sh?rev=92682 https://www.virtualbox.org/changeset?old=92681&old_path=vbox%2Ftrunk%2Fsrc%2FVBox%2FInstaller%2Flinux%2Froutines.sh&new=92682&new_path=vbox%2Ftrunk%2Fsrc%2FVBox%2FInstaller%2Flinux%2Froutines.sh Change-Id: Ifd11460e3a8fd4f4c1269453a9b8376065861b8e	2 years ago
Victor Seva	6c960afee4	TT#104221 Use bookworm repos in ensure_packages_installed appropriately Support bookworm option in DEBIAN_RELEASE selection. We have support for it already. Use bookworm as fallback since nowadays we jumped to it. Change-Id: I118c1b5cf81fe57394495b5f745fc81032406c78	2 years ago
Michael Prokop	37163532ee	MT#56773 Use bullseye puppetlabs repository for bookworm To be able to upgrade our internal systems to Debian/bookworm we need to have puppet packages available. Upstream still doesn't provide any Debian packages (see https://tickets.puppetlabs.com/browse/PA-4995), though their AIO (All In One) packages for Debian/bullseye seem to be working on Debian/bookworm as well (at least for puppet-agent). So until we either migrated to puppet-agent as present in Debian/bookworm or upstream provides according AIO packages, let's use the puppet-agent packages we already use for our Debian/bullseye systems. Change-Id: I2211ffd79f70a2a79873e737b0b512bfb7492328	2 years ago
Mykola Malkov	3a942b1b8c	MT#57453 Switch docker image to bookworm Change-Id: I9cfc7f0f6062d5e4916c7ba18b72cbc3e8c8ebbb	2 years ago
Sipwise Jenkins Builder	1cb15c866e	Release new version 11.5.0.0+0~mr11.5.0.0	2 years ago
Michael Prokop	0fedba6144	MT#57643 Ensure /var/lib/dpkg/available exists on Debian releases <=buster Since version 1.20.0, dpkg no longer creates /var/lib/dpkg/available (see #647911). Now that we upgraded our Grml-Sipwise deployment system to bookworm, we have dpkg v1.21.22 on our live system, and mmdebstrap relies on dpkg of the host system for execution. But on Debian releases until and including buster, dpkg fails to operate with e.g. `dpkg --set-selections`, if /var/lib/dpkg/available doesn't exist: \| The following NEW packages will be installed: \| nullmailer \| [...] \| debconf: delaying package configuration, since apt-utils is not installed \| dpkg: error: failed to open package info file '/var/lib/dpkg/available' for reading: No such file or directory We could also switch from mmdebstrap to debootstrap for deploying Debian releases <=buster, but this would be slower and we use mmdebstrap since quite some time for everything. So instead let's create /var/lib/dpkg/available after bootstrapping the system. Reported towards mmdebstrap as #1037946. Change-Id: I0a87ca255d5eb7144a9c093051c0a6a3114a3c0b	2 years ago
Michael Prokop	eccdc586ae	MT#57644 puppet/git: allow ssh-rsa pubkey usage Now that our deployment system is based on Debian/bookworm, but our gerrit/git server still runs on Debian/bullseye, we run into the OpenSSH RSA issue (RSA signatures using the SHA-1 hash algorithm got disabled by default), see https://michael-prokop.at/blog/2023/06/11/what-to-expect-from-debian-bookworm-newinbookworm/ and https://www.jhanley.com/blog/ssh-signature-algorithm-ssh-rsa-error/ We need to enable ssh-rsa usage, otherwise deployment fails with: \| Warning: Permanently added '[gerrit.mgm.sipwise.com]:29418' (ED25519) to the list of known hosts. \| sign_and_send_pubkey: no mutual signature supported \| puppet-r10k@gerrit.mgm.sipwise.com: Permission denied (publickey). \| fatal: Could not read from remote repository. Change-Id: I5894170dab033d52a2612beea7b6f27ab06cc586	2 years ago
Michael Prokop	8cfb8c8392	MT#57630 Check online connectivity to work around Intel E810 / ice issue Deploying the Debian/bookworm based NGCP system fails on a Lenovo sr250 v2 node with an Intel E810 network card: \| # lshw -c net -businfo \| Bus info Device Class Description \| ======================================================= \| pci@0000:01:00.0 eth0 network Ethernet Controller E810-XXV for SFP \| pci@0000:01:00.1 eth1 network Ethernet Controller E810-XXV for SFP \| # lshw -c net \| -network:0 \| description: Ethernet interface \| product: Ethernet Controller E810-XXV for SFP \| vendor: Intel Corporation \| physical id: 0 \| bus info: pci@0000:01:00.0 \| logical name: eth0 \| version: 02 \| serial: [...] \| size: 10Gbit/s \| capacity: 25Gbit/s \| width: 64 bits \| clock: 33MHz \| capabilities: pm msi msix pciexpress vpd bus_master cap_list rom ethernet physical fibre 1000bt-fd 25000bt-fd \| configuration: autonegotiation=off broadcast=yes driver=ice driverversion=1.11.14 duplex=full firmware=2.25 0x80007027 1.2934.0 ip=192.168.90.51 latency=0 link=yes multicast=yes port=fibre speed=10Gbit/s \| resources: iomemory:400-3ff iomemory:400-3ff irq:16 memory:4002000000-4003ffffff memory:4006010000-400601ffff memory:a1d00000-a1dfffff memory:4005000000-4005ffffff memory:4006220000-400641ffff We set up the /etc/network/interfaces file by invoking Grml's netcardconfig script in automated mode, like: NET_DEV=eth0 METHOD=static IPADDR=192.168.90.51 NETMASK=255.255.255.248 GATEWAY=192.168.90.49 /usr/sbin/netcardconfig The resulting /etc/network/interfaces gets used as base for usage inside the NGCP chroot/target system. netcardconfig shuts down the network interface (eth0 in the example above) via ifdown, then sleeps for 3 seconds and re-enables the interface (via ifup) with the new configuration. This used to work fine so far, but with the Intel e810 network card and kernel version 6.1.0-9-amd64 from Debian/bookworm we see a link failure and it takes ~10 seconds until the network device is up and running again. The following vagrant_configuration() execution from deployment.sh then fails: \| +11:41:01 (netscript.grml:1022): vagrant_configuration(): wget -O /var/tmp/id_rsa_sipwise.pub http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub \| --2023-06-11 11:41:01-- http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub \| Resolving builder.mgm.sipwise.com (builder.mgm.sipwise.com)... failed: Name or service not known. \| wget: unable to resolve host address 'builder.mgm.sipwise.com' However, when we retry it again just a bit later, the network works fine again. During investigation we identified that the network card flips the port, quoting the related log from the connected Cisco nexus 5020 switch (with fast stp learning mode): \| nexus5k %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/33 is down (Link failure) It seems to be related to some autonegotiation problem, as when we execute `ethtool -A eth0 rx on tx on` (no matter whether with `on` or `off`), we see: \| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: Autoneg did not complete so changing settings may not result in an actual change. \| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: NIC Link is Down \| [Tue Jun 13 08:51:45 2023] ice 0000:01:00.0 eth0: NIC Link is up 10 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: NONE, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: Rx/Tx FTR: \| root@sp1 ~ # ethtool -A eth0 autoneg off \| netlink error: Operation not supported \| 76 root@sp1 ~ # ethtool eth0 \| grep -C1 Auto-negotiation \| Duplex: Full \| Auto-negotiation: off \| Port: FIBRE \| root@sp1 ~ # ethtool -A eth0 autoneg on \| root@sp1 ~ # ethtool eth0 \| grep -C1 Auto-negotiation \| Duplex: Full \| Auto-negotiation: off \| Port: FIBRE \| root@sp1 ~ # dmesg -T \| tail -1 \| [Tue Jun 13 08:53:26 2023] ice 0000:01:00.0 eth0: To change autoneg please use: ethtool -s <dev> autoneg <on\|off> \| root@sp1 ~ # ethtool -s eth0 autoneg off \| root@sp1 ~ # ethtool -s eth0 autoneg on \| netlink error: link settings update failed \| netlink error: Operation not supported \| 75 root@sp1 ~ # As a workaround, at least until we have a better fix/solution, we try to reach the default gateway (or fall back to the repository host if gateway couldn't be identified) via ICMP/ping, and once that works we we continue as usual. But even if that should fail we continue execution, to minimize behavior change but have a workaround for this specific situation available. FTR, broken system: \| root@sp1 ~ # ethtool -i eth0 \| driver: ice \| version: 6.1.0-9-amd64 \| firmware-version: 2.25 0x80007027 1.2934.0 \| [...] Whereas with kernel 5.10.0-23-amd64 from Debian/bullseye we don't seem to see that behavior: \| root@sp1:~# ethtool -i neth0 \| driver: ice \| version: 5.10.0-23-amd64 \| firmware-version: 2.25 0x80007027 1.2934.0 \| [...] Also using latest available ice v1.11.14 (from https://sourceforge.net/projects/e1000/files/ice%20stable/1.11.14/) on Kernel version 6.1.0-9-amd64 doesn't bring any change: \| root@sp1 ~ # modinfo ice \| filename: /lib/modules/6.1.0-9-amd64/updates/drivers/net/ethernet/intel/ice/ice.ko \| firmware: intel/ice/ddp/ice.pkg \| version: 1.11.14 \| license: GPL v2 \| description: Intel(R) Ethernet Connection E800 Series Linux Driver \| author: Intel Corporation, <linux.nics@intel.com> \| srcversion: 818E9C817731C98A25470C0 \| alias: pci:v00008086d00001888svsdbcsci \| [...] \| alias: pci:v00008086d00001591svsdbcsci* \| depends: ptp \| retpoline: Y \| name: ice \| vermagic: 6.1.0-9-amd64 SMP preempt mod_unload modversions \| parm: debug:netif level (0=none,...,16=all) (int) \| parm: fwlog_level:FW event level to log. All levels <= to the specified value are enabled. Values: 0=none, 1=error, 2=warning, 3=normal, 4=verbose. Invalid values: >=5 \| (ushort) \| parm: fwlog_events:FW events to log (32-bit mask) \| (ulong) \| root@sp1 ~ # ethtool -i eth0 \| head -3 \| driver: ice \| version: 1.11.14 \| firmware-version: 2.25 0x80007027 1.2934.0 \| root@sp1 ~ # Change-Id: Ieafe648be4e06ed0d936611ebaf8ee54266b6f3c	2 years ago
Michael Prokop	f4da3e094e	MT#57049 Ensure SW-RAID device is inactive before re-reading partition table Re-reading of disks fails if the mdadm SW-RAID device is still active: \| root@sp1 ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| md0 : active raid1 sdb3[1] sda3[0] \| 468218880 blocks super 1.2 [2/2] [UU] \| [========>............] resync = 42.2% (197855168/468218880) finish=22.4min speed=200756K/sec \| bitmap: 3/4 pages [12KB], 65536KB chunk \| \| unused devices: <none> \| root@sp1 ~ # blockdev --rereadpt /dev/sdb \| blockdev: ioctl error on BLKRRPART: Device or resource busy \| 1 root@sp1 ~ # blockdev --rereadpt /dev/sda \| blockdev: ioctl error on BLKRRPART: Device or resource busy \| 1 root@sp1 ~ # Only if we stop the mdadm SW-RAID device, then we can re-read the partition table: \| root@sp1 ~ # mdadm --stop /dev/md0 \| mdadm: stopped /dev/md0 \| root@sp1 ~ # blockdev --rereadpt /dev/sda \| root@sp1 ~ # This behavior isn't new and unrelated to Debian/bookworm but was spotted while debugging an unrelated issue. FTR: we re-read the partition table (via `blockdev --rereadpt`) to ensure that /etc/fstab of the live system is up2date and matches the current system state. While this isn't stricly needed, we preserve existing behavior and also try to avoid a hard "cut" of a possibly ongoing SW-RAID sync. Change-Id: I735b00423e6efa932f74b78a38ed023576e5d306	2 years ago
Michael Prokop	2ad306c465	MT#57556 Prompt for reboot/halt only in interactive mode With our newer Grml-Sipwise ISO (v2023-06-01) being based on Debian/bookworm and recent Grml packages, our automated deployment suddenly started to fail for us: \| +04:28:12 (netscript.grml:2453): echo 'Successfully finished deployment process [Fri Jun 2 04:28:12 UTC 2023 - running 576 seconds]' \| ++04:28:12 (netscript.grml:2455): get_deploy_status \| ++04:28:12 (netscript.grml:95): get_deploy_status(): '[' -r /srv/deployment//status ']' \| ++04:28:12 (netscript.grml:96): get_deploy_status(): cat /srv/deployment//status \| Successfully finished deployment process [Fri Jun 2 04:28:12 UTC 2023 - running 576 seconds] \| +04:28:12 (netscript.grml:2455): '[' copylogfiles '!=' error ']' \| +04:28:12 (netscript.grml:2456): set_deploy_status finished \| +04:28:12 (netscript.grml:103): set_deploy_status(): '[' -n finished ']' \| +04:28:12 (netscript.grml:104): set_deploy_status(): echo finished \| +04:28:12 (netscript.grml:2459): false \| +04:28:12 (netscript.grml:2463): status_wait \| +04:28:12 (netscript.grml:329): status_wait(): [[ -n 0 ]] \| +04:28:12 (netscript.grml:329): status_wait(): [[ 0 != 0 ]] \| +04:28:12 (netscript.grml:2466): false \| +04:28:12 (netscript.grml:2471): false \| +04:28:12 (netscript.grml:2476): echo 'Do you want to [r]eboot or [h]alt the system now? (Press any other key to cancel.)' \| Do you want to [r]eboot or [h]alt the system now? (Press any other key to cancel.) \| +04:28:12 (netscript.grml:2477): unset a \| +04:28:12 (netscript.grml:2478): read -r a \| ++04:28:12 (netscript.grml:2478): wait_exit \| ++04:28:12 (netscript.grml:339): wait_exit(): local e_code=1 \| ++04:28:12 (netscript.grml:340): wait_exit(): [[ 1 -ne 0 ]] \| ++04:28:12 (netscript.grml:341): wait_exit(): set_deploy_status error \| ++04:28:12 (netscript.grml:103): set_deploy_status(): '[' -n error ']' \| ++04:28:12 (netscript.grml:104): set_deploy_status(): echo error \| ++04:28:12 (netscript.grml:343): wait_exit(): trap '' 1 2 3 6 15 ERR EXIT \| ++04:28:12 (netscript.grml:344): wait_exit(): status_wait \| ++04:28:12 (netscript.grml:329): status_wait(): [[ -n 0 ]] \| ++04:28:12 (netscript.grml:329): status_wait(): [[ 0 != 0 ]] \| ++04:28:12 (netscript.grml:345): wait_exit(): exit 1 As of grml-autoconfig v0.20.3 and newer, the grml-autoconfig systemd service that invokes the deployment netscript uses `StandardInput=null` instead of `StandardInput=tty` (see https://github.com/grml/grml/issues/176). Thanks to this, a logic error in our deployment script showed up. We exit the script in interactive mode, though only afterwards prompting for reboot/halt with `read -r a` - which of course fails if stdin is missing. As a result, we end up in our signal handler `trap 'wait_exit;' 1 2 3 6 15 ERR EXIT` and then fail the deployment. So instead prompt for "Do you want to [r]eboot or [h]alt ..." only in interactive mode, and while at it drop the "if "$INTERACTIVE" ; then exit 0 ; fi" so the prompt is actually presented to the user. Change-Id: Ia89beaf3c446f3701cc30ab21cfdff7b5808a6d3	2 years ago
Michael Prokop	98d11bfc28	MT#57280 Run deployment status server under systemd Manual execution of python's http.server has multiple drawbacks, like no proper logging and no service tracking/restart options, but most notably the deployment status server no longer runs when our deployment script fails. While /srv/deployment/status then still might contain "error", no one is serving that information on port 4242 any longer[1], and our daily-build-install-vm Jenkins job might then report: \| VM '192.168.209.162' current state is '' - retrying up to another 1646 times, sleeping for a second \| VM '192.168.209.162' current state is '' - retrying up to another 1645 times, sleeping for a second \| [...] It then runss for ~1/2 hour without doing anything useful, until the Jenkins job itself gives up. By running our deployment status server under systemd, we keep the service alive also when the deployment script terminates. In case of errors we get immediate feedback: \| VM '192.168.209.162' current state is 'puppet' - retrying up to another 1648 times, sleeping for a second \| VM '192.168.209.162' current state is 'puppet' - retrying up to another 1647 times, sleeping for a second \| VM '192.168.209.162' current state is 'error' - retrying up to another 1646 times, sleeping for a second \| + '[' error '!=' finished ']' \| + echo 'Failed to install Proxom VM '\''162'\'' (IP '\''192.168.209.162'\'')' [1] For our NGCP based installations we use the ngcpstatus boot option, where its status_wait trap kicks in and avoids premature exit of deployment status server. But e.g. our non-NGCP systems don't use that boot option and with this change we could get rid of the status_wait overall. Change-Id: Ibaa799358caedf31c64c37b48e3c5e889808086a	2 years ago
Sipwise Jenkins Builder	583ab91c89	Release new version 11.4.0.0+0~mr11.4.0.0	2 years ago
Michael Prokop	54d48f2716	MT#55861 Update grml-live version to 0.43.0 Packages like 'firmware-linux', 'firmware-linux-nonfree', 'firmware-misc-nonfree' and further 'firmware-*' got moved from non-free to the new non-free-firmware component/repository (related to https://www.debian.org/vote/2022/vote_003). grml-live v0.43.0 provides supports for this new component, so let's make sure we have proper support for firmware related packages by updating to the corresponding grml-live version. Change-Id: I4704e8be051ab6b5496021f07f42208b34963739	2 years ago
Sipwise Jenkins Builder	7a783ce25c	Release new version 11.3.0.0+0~mr11.3.0.0	2 years ago
Michael Prokop	e6819fe674	MT#55944 Use ngcp-initialize-udev-rules-net to deploy 70-persistent-net.rules Use system-tools' ngcp-initialize-udev-rules-net script to deploy the /etc/udev/rules.d/70-persistent-net.rules, no need to maintain code at multiple places. Change-Id: I81925262a8c687aa9976cbc1113568989fa53281	3 years ago
Michael Prokop	ae7db13232	MT#55944 Fix networking for plain Debian systems When building our Debian boxes for buster, bullseye + bookworm (via daily-build-matrix-debian-boxes Jenkins job), we get broken networking, so e.g. `vagrant up debian-bookworm doesn't work. This is caused by /etc/network/interfaces (using e.g. "neth0", being our naming schema which we use in NGCP, as adjusted by the deployment script) not matching the actual system network devices (like enp0s3). TL;DR: no behavior change for NGCP systems, only when building non-NGCP systems then enable net.ifnames=0 (via set_custom_grub_boot_options), but do not generate /etc/udev/rules.d/70-persistent-net.rules (via invoke generate_udev_network_rules) nor rename eth->neth in /etc/network/interfaces. More verbose version: * rename the "eth" networking interfaces into "neth" in /etc/network/interfaces only when running in ngcp-installer mode (this is the behavior we rely on in NGCP, but it doesn't matter for plain Debian systems) * generate /etc/udev/rules.d/70-persistent-net.rules only when running in ngcp-installer mode. While our jenkins-configs.git's jobs/daily-build/scripts/vm_clean-fs.sh removes the file anyways (for the VM use case), between the initial deployment run and the next reboot the configuration inside the PVE VM still applies, so we end up with an existing /etc/udev/rules.d/70-persistent-net.rules, referring to neth0, while our /etc/network/interfaces configures eth0 instead. * when not running in ngcp-installer mode, enable net.ifnames=0 usage in GRUB to disable persistent network interface naming. FTR, this change is not needed for NGCP, as on NGCP systems we use /etc/udev/rules.d/70-persistent-net.rules, generated by ngcp-system-tools' ngcp-initialize-udev-rules-net script also in VM use case This is a fixup for a change in git commit `a50903a30c` (see also commit message of git commit `ab62171`), that should have been adjusted for ngcp-installer-only mode instead. Change-Id: I6d0021dbdc2c1587127f0e115c6ff9844460a761	3 years ago
Michael Prokop	d44bcef4e6	MT#55988 Update kernel command line for installing Debian w/wo puppet The public name servers resolve deb.sipwise.com to our public OVH IP address 164.132.119.186, while internally we want to use its cname haproxy.mgm.sipwise.com. This only works with using our internal nameservers (like 192.168.212.30 and 192.168.88.20). Default to 192.168.212.30, so deployments work as expected, otherwise we're failing during deployment with: \| Err:5 https://deb.sipwise.com/autobuild release-trunk-bookworm InRelease \| 403 Forbidden [IP: 164.132.119.186 443] While at it also update the ip=... kernel option, to use 168.192.91.XX/24 by default, and also use a FQDN for the hostname (since that's our current policy for puppet hostname/certificates). Change-Id: I1ce6541f7a31baa437e679b67056bb7851b1b33d	3 years ago
Michael Prokop	338ba4fab3	MT#55861 Update Grml ISO + latest grml-live version Relevant changes: * GRMLBASE/39-modprobe: avoid usage of /lib/modprobe.d/50-nfs.conf * GRMLBASE/39-modprobe: do not expect all files in /etc/modprobe.d to be used This gives us working netboot images and avoids sysctl errors during bootup, if nfs-kernel-server should be present on the ISO. Change-Id: I0012199658c186b69c45ac51bc249ce75b8d81ce	3 years ago
Michael Prokop	6412814e6b	MT#55949 Ensure we have proper date/time configuration If the date of the running system isn't appropriate enough, then apt runs might fail with somehint like: \| E: Release file for https://deb/sipwise/com/spce/mr10.5.2/dists/bullseye/InRelease is not valid yet (invalid for another 6h 19min 2s) So let's try to sync date/time of the system via NTP. Given that chrony is a small (only 650 kB disk space) and secure replacement for ntp, let's ship chrony with the Grml deployment ISO (and fall back to ntp usage in deployment script if chrony shouldn't be available). Also, if the system is configured to read the RTC time in the local time zone, this is known as another source of problems, so let's make sure to use the RTC in UTC. Change-Id: I747665d1cee3b6f835c62812157d0203bcfa96e2	3 years ago
Michael Prokop	245c7ef702	MT#55861 Update Grml ISO + update to Debian/bookworm For deploying Debian/bookworm (see MT#55524), we'd like to have an updated Grml ISO. With such a Debian/bookworm based live system, we can still deploy older target systems (like Debian/bullseye). Relevant changes: 1) Ad jo as new build-dependency, to generate build information in conf/buildinfo.json (new dependency of grml-live) 2) Always include ca-certificates, as this is required with more recent mmdebstrap versions (>=0.8.0), when using apt repositories with https, otherwise bootstrapping Debian fails. 3) Update to latest stable grml-live version v0.42.0, which: a) added support for "bookworm" as suite name `cff66073a7` b) provides corresponding templates for memtest support: `c01a86b3fc` c) and a workaround for a kmod/initramfs-tools issue with PXE/NFS boot: `ea1e5ea330` 4) Update memtest86+ to v6.00-1 as present in Debian/bookworm and add corresponding UEFI support (based on grml-live's upstream change, though as we don't support i386, dropped the 32bit related bits) Change-Id: I327c0e25c28f46e097212ef4329d75fc8d34767c	3 years ago
Guillem Jover	ad9e94efb6	MT#55861 Load the fake-uname.so pre-loaded library from within the chroot We build the pre-loaded library targeting a specific Debian release, which might be different (and newer) to the release Grml was built for. This can cause missing versioned symbols (and a loading failure) if the libc in the outer system is older than the inner system. Change-Id: I84f4f307863e534fe0fff85274ae1d5db809012c	3 years ago
Michael Prokop	d1d0e61512	MT#55379 Use usrmerge for Debian/bookworm based systems The transition to usrmerge has started in Debian, see https://lists.debian.org/debian-devel-announce/2022/09/msg00001.html Debian/bookworm AKA v12 will only support the merged-/usr layout. Systemd is also dropping support for unmerged-usr systems (see https://lists.freedesktop.org/archives/systemd-devel/2022-September/048352.html). Deploy the expected filesystem layout accordingly, as in: 1) no-merged-usr for Debian release up and including bullseye, and 2) merged-usr starting with bookworm and newer Change-Id: I7b7b294ce12ca245cf978a787bcc20aa9753e73d	3 years ago
Sipwise Jenkins Builder	7bb58e612a	Release new version 11.2.0.0+0~mr11.2.0.0	3 years ago
Michael Prokop	b372471a20	TT#15305 Fix ngcp-deployment-scripts usage for daily-build-matrix-debian-boxes Git commit `6661b04af0` broke all our bullseye based builds (debian, sipwise + docker), see https://jenkins.mgm.sipwise.com/view/All/job/daily-build-matrix-debian-boxes/ For plain Debian installations we don't have SP_VERSION available, so default to what was used before supporting trunk-weekly next to trunk. Change-Id: I61958f0c67d165d2f6dcb059fe4991ed24a328c9	3 years ago

1 2 3 4 5 ...

527 Commits (14a7473db21598b7342a963f3a9c4d09c1e9026a) All Branches Search

527 Commits (14a7473db21598b7342a963f3a9c4d09c1e9026a)

All Branches