deployment-iso

Commit Graph

Author	SHA1	Message	Date
Michael Prokop	c828990503	MT#62436 Use virtualbox-guest-additions ISO from upstream on Debian/trixie virtualbox-guest-additions-iso v7.0.20-1 as present in current Debian/trixie doesn't yet support kernel v6.12.22-1 (being the current kernel version in Debian/trixie), while upstream supports kernel 6.12 as of VirtualBox 7.1.4. Reported towards Debian as https://bugs.debian.org/1104024 FTR: \| mprokop@jenkins1 ~ % cd /var/www/files \| mprokop@jenkins1 ~www/files % wget https://download.virtualbox.org/virtualbox/7.1.8/VBoxGuestAdditions_7.1.8.iso \| [...] \| mprokop@jenkins1 ~www/files % curl -s https://download.virtualbox.org/virtualbox/7.1.8/SHA256SUMS \| sha256sum -c --ignore-missing \| VBoxGuestAdditions_7.1.8.iso: OK Change-Id: I32aa7806e375c4b85084a99d5a6903f632807694	2 days ago
Michael Prokop	112f883d49	MT#62436 ensure_packages_installed: to not get stuck on conf file conflicts Our deployment ISO might be outdated and when installing any additional packages, we might get stuck in dpkg: \| +10:10:34 (netscript.grml:311): ensure_packages_installed(): DEBIAN_FRONTEND=noninteractive \| +10:10:34 (netscript.grml:311): ensure_packages_installed(): apt-get -o dir::cache=/tmp/ngcp-deployment-ensure-tmp.BKSocMV4KB/cachedir -o dir::state=/tmp/ngcp-deployment-ensure-tmp.BKSocMV4KB/statedir -o dir::etc=/tmp/ngcp-deployment-ensure-tmp.BKSocMV4KB/etc -o dir::e \| tc::trustedparts=/etc/apt/trusted.gpg.d/ -y --no-install-recommends install jq \| Reading package lists... \| Building dependency tree... \| The following additional packages will be installed: \| [...] \| Get:33 https://debian.sipwise.com/debian trixie/main amd64 libnss-myhostname amd64 257.5-2 [113 kB] \| Preconfiguring packages ... \| Fetched 25.3 MB in 4s (6777 kB/s) \| (Reading database ... 32224 files and directories currently installed.) \| Preparing to unpack .../base-files_13.7_amd64.deb ... \| Unpacking base-files (13.7) over (12.4+deb12u10) ... \| Setting up base-files (13.7) ... \| Installing new version of config file /etc/debian_version ... \| \| Configuration file '/etc/issue' \| ==> Modified (by you or by a script) since installation. \| ==> Package distributor has shipped an updated version. \| What would you like to do about it ? Your options are: \| Y or I : install the package maintainer's version \| N or O : keep your currently-installed version \| D : show the differences between the versions \| Z : start a shell to examine the situation \| The default action is to keep your current version. \| \| *** issue (Y/I/N/O/D/Z) [default=N] ? # Avoid this, by setting DPKG option `--force-confnew`. Change-Id: Ic5fed3dbe4744e07290159cec6952468c0557c29	2 days ago
Michael Prokop	779b43b915	MT#62436 Support Debian/trixie in ensure_packages_installed vboxadd-service.service fails on our Debian/trixie systems: \| root@spce:~# lsb_release -c \| Codename: trixie \| \| root@spce:~# systemctl --failed \| UNIT LOAD ACTIVE SUB DESCRIPTION \| ● vboxadd-service.service loaded failed failed VirtualBox Guest Additions Services Daemon \| \| Legend: LOAD → Reflects whether the unit definition was properly loaded. \| ACTIVE → The high-level unit activation state, i.e. generalization of SUB. \| SUB → The low-level unit activation state, values depend on unit type. \| \| 1 loaded units listed. \| \| root@spce:~# sudo systemctl status vboxadd-service.service \| × vboxadd-service.service - VirtualBox Guest Additions Services Daemon \| Loaded: loaded (/etc/systemd/system/vboxadd-service.service; disabled; preset: disabled) \| Drop-In: /etc/systemd/system/vboxadd-service.service.d \| └─override.conf \| Active: failed (Result: exit-code) since Thu 2025-04-24 09:08:15 CEST; 34min ago \| Invocation: 4e151a29f0054a90a717a928fcfb3f8d \| Mem peak: 2.2M \| CPU: 17ms \| \| Apr 24 09:08:15 spce systemd[1]: Starting vboxadd-service.service... \| Apr 24 09:08:15 spce vboxadd-service[1934]: vboxadd-service.sh: Starting VirtualBox Guest Addition service. \| Apr 24 09:08:15 spce vboxadd-service.sh[1937]: Starting VirtualBox Guest Addition service. \| Apr 24 09:08:15 spce vboxadd-service[1940]: VBoxService: error: VbglR3Init failed with rc=VERR_FILE_NOT_FOUND \| Apr 24 09:08:15 spce vboxadd-service.sh[1943]: VirtualBox Guest Addition service started. \| Apr 24 09:08:15 spce systemd[1]: vboxadd-service.service: Control process exited, code=exited, status=1/FAILURE \| Apr 24 09:08:15 spce systemd[1]: vboxadd-service.service: Failed with result 'exit-code'. \| Apr 24 09:08:15 spce systemd[1]: Failed to start vboxadd-service.service. \| \| root@spce:~# cat /etc/systemd/system/vboxadd.service.d/override.conf \| [Unit] \| ConditionVirtualization=oracle \| \| root@spce:~# cat /var/log/vboxadd-setup.log \| Building the main Guest Additions 7.0.6 module for kernel 6.12.22-amd64. \| Error building the module. Build output follows. \| make V=1 CONFIG_MODULE_SIG= CONFIG_MODULE_SIG_ALL= -C /lib/modules/6.12.22-amd64/build M=/tmp/vbox.0 SRCROOT=/tmp/vbox.0 -j2 modules \| make[1]: warning: -j2 forced in submake: resetting jobserver mode. \| [...] \| [,,,] /tmp/vbox.0/VBoxGuest-common.c \| /tmp/vbox.0/VBoxGuest-linux.c:196:21: error: ‘no_llseek’ undeclared here (not in a function); did you mean ‘noop_llseek’? \| 196 \| llseek: no_llseek, \| \| ^~~~~~~~~ \| \| noop_llseek \| /tmp/vbox.0/VBoxGuest-linux.c: In function ‘vgdrvLinuxParamLogGrpSet’: \| /tmp/vbox.0/VBoxGuest-linux.c:1364:9: error: implicit declaration of function ‘strlcpy’; did you mean ‘strncpy’? [-Wimplicit-function-declaration] \| 1364 \| strlcpy(&g_szLogGrp[0], pszValue, sizeof(g_szLogGrp)); \| \| ^~~~~~~ \| \| strncpy \| make[2]: * [/usr/src/linux-headers-6.12.22-common/scripts/Makefile.build:234: /tmp/vbox.0/VBoxGuest-linux.o] Error 1 \| make[2]: * Waiting for unfinished jobs.... \| [...] We get virtualbox-guest-additions-iso v7.0.6-1 for Debian stable/bookworm, but virtualbox-guest-additions-iso v7.0.20-1 is available in current Debian testing AKA trixie. Ensure we use the package from trixie for trixie based systems, even though the the VirtualBox Guest Additions v7.0.20 don't work for kernel 6.12.22 either, yet. Also adjust ensure_packages_installed to fail installation, if we're using a yet unknown/unexpected Debian release, to not fall back to Debian/bookworm, to prevent issue like it has been observed here. See MT#60815 for main tracking issue WRT Debian/trixie Change-Id: I030525d37edbe1cf75065d021b51d38273ce81ef	2 days ago
Michael Prokop	b2e2954852	MT#62436 Fix shellcheck issues + parse IP information programmatically As reported when sending new deployment-iso reviews, triggered by newer docker image / shellcheck: \| not ok 1 source/templates/scripts/includes/deployment.sh:1543:10: warning: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a. [SC2206] \| not ok 2 source/templates/scripts/includes/deployment.sh:1903:22: warning: Prefer mapfile or read -a to split command output (or quote to avoid splitting). [SC2207] \| not ok 3 source/templates/scripts/includes/deployment.sh:2275:20: warning: Prefer mapfile or read -a to split command output (or quote to avoid splitting). [SC2207] \| not ok 4 source/templates/scripts/includes/deployment.sh:2486:12: note: Not following: ./etc/profile.d/puppet-agent.sh was not specified as input (see shellcheck -x). [SC1091] Let's take this as a chance to properly parse ip(8) output via its JSON output, instead of awk/sed magic. Change-Id: I723959626fb514ab9e57202b0e5f415b411f5a01	2 days ago
Guillem Jover	dfd46069e7	MT#62436 Remove workaround for vboxadd services We have made these services conditional on running inside a VirtualBox VM, so we do not need to remove them anymore. Change-Id: I6dc563688ba5b0c5e935b0cb88767fcb05ab9a19	3 weeks ago
Michael Prokop	41029ed891	MT#61264 Mark EFI partition as such only when running in an EFI environment On Debian/trixie we get a failing efi.mount systemd unit: \| root@sp1:~# systemctl --failed \| UNIT LOAD ACTIVE SUB DESCRIPTION \| ● efi.mount loaded failed failed EFI System Partition Automount \| \| Legend: LOAD → Reflects whether the unit definition was properly loaded. \| ACTIVE → The high-level unit activation state, i.e. generalization of SUB. \| SUB → The low-level unit activation state, values depend on unit type. \| \| 1 loaded units listed. \| \| root@sp1:~# systemctl status efi.mount \| × efi.mount - EFI System Partition Automount \| Loaded: loaded (/run/systemd/generator.late/efi.mount; generated) \| Active: failed (Result: exit-code) since Fri 2024-11-15 17:20:59 CET; 28min ago \| Invocation: 62c7b659dfd540e294f4b1f6fcda5e13 \| TriggeredBy: ● efi.automount \| Where: /efi \| What: /dev/disk/by-diskseq/9-part2 \| Docs: man:systemd-gpt-auto-generator(8) \| Mem peak: 1.5M \| CPU: 8ms \| \| Nov 15 17:20:59 sp1 systemd[1]: Mounting efi.mount - EFI System Partition Automount... \| Nov 15 17:20:59 sp1 mount[631]: mount: /efi: wrong fs type, bad option, bad superblock on /dev/sda2, missing codepage or helper program, or other error. \| Nov 15 17:20:59 sp1 mount[631]: dmesg(1) may have more information after failed mount system call. \| Nov 15 17:20:59 sp1 systemd[1]: efi.mount: Mount process exited, code=exited, status=32/n/a \| Nov 15 17:20:59 sp1 systemd[1]: efi.mount: Failed with result 'exit-code'. \| Nov 15 17:20:59 sp1 systemd[1]: Failed to mount efi.mount - EFI System Partition Automount. \| \| root@sp1:~# ls -la /efi \| ls: cannot open directory '/efi': No such device \| \| root@sp1:~# ls -la /dev/disk/by-diskseq/9-part2 \| lrwxrwxrwx 1 root root 10 Nov 15 17:20 /dev/disk/by-diskseq/9-part2 -> ../../sda2 \| \| root@sp1:~# blkid /dev/sda2 \| /dev/sda2: PARTLABEL="EFI System" PARTUUID="fa67b52e-c018-401d-ac71-fad324cad193" The efi.mount systemd unit is automatically generated by systemd-gpt-auto-generator. Quoting from systemd-gpt-auto-generator(8): \| The ESP is mounted to /boot/ if that directory exists and is not used \| for XBOOTLDR, and otherwise to /efi/ This got introduced as of systemd v254, see `6a488fa7cc` for details. Now having systemd v256.7-3 in current Debian/testing AKA trixie, we need to make sure to not present an EFI partition if we actually don't use it. So when we don't run in an EFI environment do not mark the second partition as EFI one. Change-Id: I546c77fce862a41594500a33da1178c5c6182a1a	5 months ago
Michael Prokop	cfe9cceb6a	MT#61271 trixie: adjust sshd_config after system is installed If we set up /etc/ssh/sshd_config early in early system deployment, we end up with an empty /etc/ssh/sshd_config configuration file with only our own changes: \| root@spce:~# cat /etc/ssh/sshd_config \| # added by deployment.sh \| PerSourcePenalties no \| # end of deployment.sh changes \| ### Added by ngcp-installer \| PermitRootLogin yes The other defaults of sshd are OK for us, but for automated SSH logins we also need: AuthorizedKeysFile %h/.ssh/authorized_keys %h/.ssh/sipwise_vagrant_key And for SCP-ing files we also need: Subsystem sftp /usr/lib/openssh/sftp-server Otherwise our Jenkins job fail due to failing ssh/scp actions. So instead move our trixie specific code in deployment.sh for adjusting /etc/ssh/sshd_config to be executed after installing base system. Then the openssh-server package sets up /etc/ssh/sshd_config as expected, and we only extend its configuration then. While at it, explicitly mark beginning and end of our changes. Change-Id: I68a235b55e9cf18c39e9034b7f3b2ed0ffd237f0	6 months ago
Michael Prokop	6eee97de7b	MT#61265 trixie: avoid SSH login failures due to OpenSSH penalize feature Our https://jenkins.mgm.sipwise.com/job/daily-build-matrix-debian-boxes/ matrix no longer provides builds for debian/trixie, because its daily-build-images subproject Jenkins job with its proxmox-vm-clean-fs job failed to run. After running proxmox-vm-clean-fs under `set -x`, and also overriding the ssh_wrapper function with `ssh -v ...`, I managed to grab this from the Jenkins job execution: \| + ssh -v -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o 'ServerAliveInterval 10' -o 'ConnectTimeout 15' 192.168.210.101 'rm -vf /etc/udev/rules.d/70-persistent-net.rules' \| OpenSSH_9.2p1 Debian-2+deb12u3, OpenSSL 3.0.14 4 Jun 2024 \| debug1: Reading configuration data /var/lib/jenkins/.ssh/config \| debug1: /var/lib/jenkins/.ssh/config line 7: Applying options for 192.168.* \| debug1: Reading configuration data /etc/ssh/ssh_config \| debug1: /etc/ssh/ssh_config line 50: Applying options for * \| debug1: /etc/ssh/ssh_config line 57: Deprecated option "useroaming" \| debug1: Connecting to 192.168.210.101 [192.168.210.101] port 22. \| debug1: fd 3 clearing O_NONBLOCK \| debug1: Connection established. \| debug1: identity file /var/lib/jenkins/.ssh/id_rsa_sipwise type 0 \| debug1: identity file /var/lib/jenkins/.ssh/id_rsa_sipwise-cert type -1 \| debug1: identity file /var/lib/jenkins/.ssh/id_rsa type 0 \| debug1: identity file /var/lib/jenkins/.ssh/id_rsa-cert type -1 \| debug1: identity file /var/lib/jenkins/.ssh/id_dsa type -1 \| debug1: identity file /var/lib/jenkins/.ssh/id_dsa-cert type -1 \| debug1: Local version string SSH-2.0-OpenSSH_9.2p1 Debian-2+deb12u3 \| debug1: kex_exchange_identification: banner line 0: Not allowed at this time The `Not allowed at this time` pointed to a new OpenSSH feature, which triggered the regression for us. OpenSSH introduced options to penalize undesirable behavior, see https://undeadly.org/cgi?action=article;sid=20240607042157 and https://www.openssh.com/releasenotes.html#9.9p1 and https://sources.debian.org/src/openssh/1:9.9p1-1/sshd.c/?hl=576#L573 This is now present as of openssh-server v1:9.9p1-1 since end of September 2024 also in Debian/trixie. Now, when too many SSH logins fail, a client system can't necessarily no longer connect via SSH due this new penalty behavior. And indeed, within our Jenkins job "daily-build-install-vm" we try to collect several log files through our grab_log and SSH wrapper: \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /mnt/tmp/ngcp-installer-debug.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/ngcp-installer-debug.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /var/log/ngcp-installer-debug.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/ngcp-installer-debug.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /mnt/tmp/ngcp-installer.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/ngcp-installer.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /var/log/ngcp-installer.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/ngcp-installer.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /tmp/ngcp-installer-cmdline.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/ngcp-installer-cmdline.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /mnt/var/log/deployment.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/deployment.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /var/log/deployment.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/deployment.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /mnt/var/log/grml-debootstrap.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/grml-debootstrap.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /var/log/grml-debootstrap.log /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/grml-debootstrap.log \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /var/log/syslog /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/syslog \| + timeout 20 sipwise-ssh-copier 192.168.210.101 root sipwise /var/log/boot /buildtmpfs/tmp_jenkins-vm-builder/vmbuilder101/192.168.210.101/boot We even execute this grab_log wrapper twice: once for the running Grml live system, and once when we booted into the actually deployed system. This works fine for the Grml live system situation, but as root logins aren't allowed by default in OpenSSH since quite some time, all the sipwise-ssh-copier runs with user/password against a plain Debian system then fail. As a consequence, we lock ourselves out of the system with all those SSH login failures, and the Jenkins job proxmox-vm-clean-fs then runs into the OpenSSH penalty, which causes the trixie/debian job to fail. We use our Debian images as base for further configuration, where we control the sshd_config file through our ngcpcfg system anyways, so the `PerSourcePenalties no` setting is supposed to disappear then. FTR: We could also enable `PermitRootLogin yes` in sshd_config to get the grab_log working, though this didn't have any relevance for us so far. Disabling only the `PerSourcePenalties` feature feels like a better trade-off, at least security wise, for now. Change-Id: Ibf16019b4787cc63d450501c8bccebeac77dd9f1	6 months ago
Sipwise Jenkins Builder	862c84ccc6	MT#60698 Add mr12.5 LTS key to bootstrap Now it contains: pub rsa4096 2015-03-05 [SC] [expires: 2029-10-12] 68A702B1FD8E422AAAA1ADA3773236EFF411A836 uid [ unknown] Sipwise GmbH (Sipwise Repository Key) <support@sipwise.com> sub rsa4096 2015-03-05 [E] [expires: 2029-10-12] pub rsa4096 2011-06-06 [SC] F7B8A739CE638D719A078C9859104633EE5E097D uid [ unknown] Sipwise autobuilder (Used to sign packages for autobuild) <development@sipwise.com> sub rsa4096 2011-06-06 [E] pub rsa4096 2022-05-31 [SCEA] [expires: 2032-05-28] 39EB73D5B54870181632E48786C3B4395CB844A2 uid [ unknown] Sipwise autobuilder <development@sipwise.com> pub rsa4096 2023-08-04 [SCEA] [expires: 2033-08-01] F0A595D85C375447BB09F25E34A72CE4979CA98A uid [ unknown] Sipwise autobuilder <development@sipwise.com> pub rsa4096 2024-08-14 [SCEA] [expires: 2034-08-12] A164D3A12AC0F6AB8F737EF66D1B7D01D2AD9C24 uid [ unknown] Sipwise autobuilder <development@sipwise.com> Change-Id: I142de8611572fd35fa6bbac3695b236a1b3f9a97	8 months ago
Michael Prokop	cf94193f88	MT#60284 Ensure to start qemu-guest-agent only after package got installed We install the qemu-guest-agent package in ensure_packages_installed(). Try to start the qemu-guest-agent service only afterwards therefore. Fixup for commit `82e6638b40` Change-Id: Ic4aa2e493851b4c92ac134d68a9a76e05485658d	11 months ago
Michael Prokop	4a292ab4be	MT#60284 Only check whether /dev/virtio-ports/org.qemu.guest_agent.0 exists /dev/virtio-ports/org.qemu.guest_agent.0 usually is a symlink to the character device /dev/vport1p1. So adjust the device check accordingly and only verify it exists, but don't expected any special file type. This actually matches the behavior we also have in ngcp-installer. Fixup for commit `82e6638b40` Change-Id: I0aa93c1f0e1086847eb7ed6967692a52e183bdc3	11 months ago
Michael Prokop	82e6638b40	MT#60284 Make sure qemu-guest-agent is available Now that we enabled the QEMU Guest Agent option for our PVE VMs, we need to have qemu-guest-agent present and active. Otherwise the VMs might fail to shut down, like with our debian/sipwise/docker Debian systems which are created via https://jenkins.mgm.sipwise.com/job/daily-build-matrix-debian-boxes/: \| [proxmox-vm-shutdown] $ /bin/sh -e /tmp/env-proxmox-vm-shutdown7956268380939677154.sh \| [environment-script] Adding variable 'vm1reset' with value 'NO' \| [environment-script] Adding variable 'vm2' with value 'none' \| [environment-script] Adding variable 'vm1' with value 'none' \| [environment-script] Adding variable 'vm2reset' with value 'NO' \| [proxmox-vm-shutdown] $ /bin/bash /tmp/jenkins14192704603218787414.sh \| Using safe VM 'shutdown' for modern releases (mr6.5+). Executing action 'shutdown'... \| Shutting down VM 106 \| Build timed out (after 10 minutes). Marking the build as aborted. \| Build was aborted \| [WS-CLEANUP] Deleting project workspace... Let's make sure qemu-guest-agent is available in our Grml live system. We added qemu-guest-agent to the package list of our Grml Sipwise ISO (see git rev `65c3fea4c`), but to ensure we don't strictly depend on this brand new Grml Sipwise ISO yet, make sure to install it on-the-fly if not yet present (like we already did for git, augeas-tools + gdisk). Also make sure qemu-guest-agent service is enabled if socket /dev/virtio-ports/org.qemu.guest_agent.0 is present (indicating that the agent feature is enabled on VM level). Furthermore ensure qemu-guest-agent is present also in the installed Debian system. Otherwise when rebooting the VM once it's no longer running the Grml live system but the installed Debian system, it might also fail to shutdown. So add it to the default package list of packages for bootstrapping. Change-Id: Id6adac55a47cfaed542cad2f9ac9740783e6d924	11 months ago
Mykola Malkov	6cf4786735	MT#59872 Remove NGCP_PXE_INSTALL variable With this variable we had some tricks in ngcp-initial-configuration if the Pro sp2 node is installer via iPXE/cm image. Now we support installation of sp2 via iPXE only so no need to pass this variable. But we need to keep parent ngcppxeinstall parameter as we need this information for netcardconfig. Change-Id: I20491289917cbb427ad6f5670f108c632838be71	1 year ago
Guillem Jover	0a91a49826	MT#58014 Remove support for fetching OpenPGP certificates from keyservers The code trying to fetch the OpenPGP certificate from a keyserver has been non-functional for a while as the GPG_KEY_SERVER variable was removed in commit `316c28bcc2`. Instead of restoring the variable with an up-to-date keyserver (not part of the SKS pool, as that network is dead), we remove the support entirely as it's a potential security issue due to fingerprint collisions for example. As a side effect this removes apt-key usage which has been deprecated upstream and is slated for removal. Change-Id: I63171a66201c631da9233d54579bd1601ff22e3e	1 year ago
Michael Prokop	e99f33e11a	TT#118659 Do not fail when deploying SW-RAID if no RAID was present yet Followup fix for commit `fc9b43f92e` (Fix re-deploying over existing SW-RAID arrays). We try to detect present SW-RAIDs and identify the disks which are part of the RAID array, to be able to properly reset them then. Though if we don't find such an existing SW-RAID array the orig_swraid_device variable stays unset and our deployments with SW-RAID fails now, as observed on carrier-sp1-trunk: \| root@carrier-sp1-trunk ~ # tail -20 /tmp/deployment-installer-debug.log \| ++02:00:04 (netscript.grml:620): set_up_partition_table_swraid(): head -1 \| ++02:00:04 (netscript.grml:620): set_up_partition_table_swraid(): lsblk --list --noheadings --output TYPE,NAME \| Sleeping for 10 seconds (as requested via boot option 'ngcpstatus') \| +02:00:04 (netscript.grml:620): set_up_partition_table_swraid(): raid_device= \| +02:00:04 (netscript.grml:623): set_up_partition_table_swraid(): [[ -n '' ]] \| +02:00:04 (netscript.grml:645): set_up_partition_table_swraid(): [[ -b /dev/md0 ]] \| /tmp/netscript.grml: line 669: orig_swraid_device: unbound variable \| ++02:00:04 (netscript.grml:1): set_up_partition_table_swraid(): wait_exit \| ++02:00:04 (netscript.grml:339): wait_exit(): local e_code=1 \| ++02:00:04 (netscript.grml:340): wait_exit(): [[ 1 -ne 0 ]] \| ++02:00:04 (netscript.grml:341): wait_exit(): set_deploy_status error \| ++02:00:04 (netscript.grml:103): set_deploy_status(): '[' -n error ']' \| ++02:00:04 (netscript.grml:104): set_deploy_status(): echo error \| ++02:00:04 (netscript.grml:343): wait_exit(): trap '' 1 2 3 6 15 ERR EXIT \| ++02:00:04 (netscript.grml:344): wait_exit(): status_wait \| ++02:00:04 (netscript.grml:329): status_wait(): [[ -n 10 ]] \| ++02:00:04 (netscript.grml:329): status_wait(): [[ 10 != 0 ]] \| ++02:00:04 (netscript.grml:333): status_wait(): echo 'Sleeping for 10 seconds (as requested via boot option '\''ngcpstatus'\'')' \| ++02:00:04 (netscript.grml:334): status_wait(): sleep 10 \| ++02:00:14 (netscript.grml:345): wait_exit(): exit 1 FTR: \| root@carrier-sp1-trunk ~ # cat /proc/cmdline \| BOOT_IMAGE=vmlinuz initrd=initrd.img fetch=http://builder6.mgm.sipwise.com:3000/ngcp-pxe-boot-sipwise20230915/fs/grml64-small/grml64-small.squashfs boot=live ignore_bootid apm=power-off nomce net.ifnames=0 noprompt noeject vga=791 ssh=sipwise ethdevice=eth0 ethdevice-timeout=30 live-netdev=eth0 netscript=http://deb.sipwise.com/netscript/master/deployment.sh debianrelease=bookworm lowperformance enablevmservices debugmode ngcpvers=trunk ngcpnoupload ngcppro ngcpsp1 ngcphostname=web01a ngcpcrole=mgmt ngcpnonwrecfg dns=1.1.1.1,1.0.0.1 ngcpeaddr=192.168.209.180 ip=192.168.209.180::192.168.209.1:255.255.255.0:sp1:eth0:off vagrant swraiddisk1=sda swraiddisk2=sdb ngcpnodename=sp1 ngcpstatus=10 swapfilesize=2048M rootfssize=8G fallbackfssize=10M \| \| root@carrier-sp1-trunk ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| unused devices: <none> \| \| root@carrier-sp1-trunk ~ # lsblk \| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS \| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/grml64-small.squashfs \| /run/live/rootfs/grml64-small.squashfs \| sda 8:0 0 16G 0 disk \| sdb 8:16 0 16G 0 disk \| sr0 11:0 1 1024M 0 rom Change-Id: I2329aaa0754674b5d192a174b644900f09f9db84	1 year ago
Michael Prokop	1d59d89d04	TT#118659 Do not abort on disk partition listing failures We identify any existing partitions of the disk we need to wipe via: \| root@license42 ~ # lsblk --noheadings --output KNAME /dev/sda \| sda \| sda1 \| sda2 \| sda3 \| root@license42 ~ # blockdevice="/dev/sda" \| root@license42 ~ # lsblk --noheadings --output KNAME /dev/sda \| grep -v "^${blockdevice#\/dev\/}$" \| sda1 \| sda2 \| sda3 This might fail though, if there are no partitions present: \| root@license42 ~ # dd if=/dev/zero of=/dev/sda bs=10M count=1 \| 1+0 records in \| 1+0 records out \| 10485760 bytes (10 MB, 10 MiB) copied, 0.0487036 s, 215 MB/s \| root@license42 ~ # pvremove /dev/sda --force --force --yes \| Labels on physical volume "/dev/sda" successfully wiped. \| root@license42 ~ # blockdevice="/dev/sda" \| root@license42 ~ # lsblk --noheadings --output KNAME /dev/sda \| grep -v "^${blockdevice#\/dev\/}$" \| 1 root@license42 ~ # Ending up in our daily-build-install-vm Jenkins jobs like this: \| +13:08:19 (netscript.grml:489): clear_partition_table(): echo 'Removing possibly existing LVM/PV label from /dev/sda' \| +13:08:19 (netscript.grml:490): clear_partition_table(): pvremove /dev/sda --force --force --yes \| Labels on physical volume "/dev/sda" successfully wiped. \| ++13:08:19 (netscript.grml:495): clear_partition_table(): grep -v '^sda$' \| ++13:08:19 (netscript.grml:495): clear_partition_table(): lsblk --noheadings --output KNAME /dev/sda \| +++13:08:19 (netscript.grml:495): clear_partition_table(): wait_exit \| +++13:08:19 (netscript.grml:339): wait_exit(): local e_code=1 \| +++13:08:19 (netscript.grml:340): wait_exit(): [[ 1 -ne 0 ]] \| +++13:08:19 (netscript.grml:341): wait_exit(): set_deploy_status error \| +++13:08:19 (netscript.grml:103): set_deploy_status(): '[' -n error ']' \| +++13:08:19 (netscript.grml:104): set_deploy_status(): echo error \| Wiping disk signatures from /dev/sda \| +++13:08:19 (netscript.grml:343): wait_exit(): trap '' 1 2 3 6 15 ERR EXIT \| +++13:08:19 (netscript.grml:344): wait_exit(): status_wait \| +++13:08:19 (netscript.grml:329): status_wait(): [[ -n 0 ]] \| +++13:08:19 (netscript.grml:329): status_wait(): [[ 0 != 0 ]] Followup change for `e9244a289b`, to fix failing VM deployments. Change-Id: Ic63ecf5dd090722705473ad5aac289473b082650	1 year ago
Michael Prokop	fc9b43f92e	TT#118659 Fix re-deploying over existing SW-RAID arrays Fresh deployments with SW-RAID (Software-RAID) might fail if the present disks were already part of an SW-RAID setup: \| Error: disk nvme1n1 seems to be part of an existing SW-RAID setup. We could also reproduce this inside PVE VMs: \| mdadm: /dev/md/127 has been started with 2 drives. \| Error: disk sda seems to be part of an existing SW-RAID setup. This is caused by the following behavior: \| + SWRAID_DEVICE="/dev/md0" \| [...] \| + mdadm --assemble --scan \| + true \| + [[ -b /dev/md0 ]] \| + for disk in "${SWRAID_DISK1}" "${SWRAID_DISK2}" \| + grep -q nvme1n1 /proc/mdstat \| + die 'Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.' \| + echo 'Error: disk nvme1n1 seems to be part of an existing SW-RAID setup.' \| Error: disk nvme1n1 seems to be part of an existing SW-RAID setup. By default we expect and set the SWRAID_DEVICE to be /dev/md0. But only "local" arrays get assembled as /dev/md0 and upwards, whereas "foreign" arrays start at md127 downwards. This is exactly what we get when booting our deployment live system on top of an existing installation, and assemble existing SW-RAIDs (to not overwrite unexpected disks by mistake): \| root@grml ~ # lsblk \| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS \| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/ngcp.squashfs \| /run/live/rootfs/ngcp.squashfs \| nvme0n1 259:0 0 447.1G 0 disk \| └─md127 9:127 0 447.1G 0 raid1 \| ├─md127p1 259:14 0 18G 0 part \| ├─md127p2 259:15 0 18G 0 part \| ├─md127p3 259:16 0 405.6G 0 part \| ├─md127p4 259:17 0 512M 0 part \| ├─md127p5 259:18 0 4G 0 part \| └─md127p6 259:19 0 1G 0 part \| nvme1n1 259:7 0 447.1G 0 disk \| └─md127 9:127 0 447.1G 0 raid1 \| ├─md127p1 259:14 0 18G 0 part \| ├─md127p2 259:15 0 18G 0 part \| ├─md127p3 259:16 0 405.6G 0 part \| ├─md127p4 259:17 0 512M 0 part \| ├─md127p5 259:18 0 4G 0 part \| └─md127p6 259:19 0 1G 0 part \| \| root@grml ~ # lsblk -l -n -o TYPE,NAME \| loop loop0 \| raid1 md127 \| disk nvme0n1 \| disk nvme1n1 \| part md127p1 \| part md127p2 \| part md127p3 \| part md127p4 \| part md127p5 \| part md127p6 \| \| root@grml ~ # cat /proc/cmdline \| vmlinuz initrd=initrd.img swraiddestroy swraiddisk2=nvme0n1 swraiddisk1=nvme1n1 [...] Let's identify existing RAID devices and check their configuration by going through the disks and comparing them with our SWRAID_DISK1 and SWRAID_DISK2. If they don't match with each other, we stop execution to prevent any possible data damage. Furthermore, we need to assemble the mdadm array without relying on a possibly existing local `/etc/mdadm/mdadm.conf` configuration file. Otherwise assembling might fail: \| root@grml ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| unused devices: <none> \| root@grml ~ # lsblk -l -n -o TYPE,NAME \| awk '/^raid/ {print $2}' \| root@grml ~ # grep ARRAY /etc/mdadm/mdadm.conf \| ARRAY /dev/md/127 metadata=1.0 UUID=0d44774e:7269bac6:2f02f337:4551597b name=localhost:127 \| root@grml ~ # mdadm --assemble --scan \| 2 root@grml ~ # mdadm --assemble --scan --verbose \| mdadm: looking for devices for /dev/md/127 \| mdadm: No super block found on /dev/loop0 (Expected magic a92b4efc, got 800989c0) \| mdadm: no RAID superblock on /dev/loop0 \| mdadm: No super block found on /dev/nvme1n1p3 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme1n1p3 \| mdadm: No super block found on /dev/nvme1n1p2 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme1n1p2 \| mdadm: No super block found on /dev/nvme1n1p1 (Expected magic a92b4efc, got 000080fe) \| mdadm: no RAID superblock on /dev/nvme1n1p1 \| mdadm: No super block found on /dev/nvme1n1 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme1n1 \| mdadm: No super block found on /dev/nvme0n1p3 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme0n1p3 \| mdadm: No super block found on /dev/nvme0n1p2 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme0n1p2 \| mdadm: No super block found on /dev/nvme0n1p1 (Expected magic a92b4efc, got 000080fe) \| mdadm: no RAID superblock on /dev/nvme0n1p1 \| mdadm: No super block found on /dev/nvme0n1 (Expected magic a92b4efc, got 00000000) \| mdadm: no RAID superblock on /dev/nvme0n1 \| 2 root@grml ~ # mdadm --assemble --scan --config /dev/null \| mdadm: /dev/md/grml:127 has been started with 2 drives. \| root@grml ~ # lsblk -l -n -o TYPE,NAME \| awk '/^raid/ {print $2}' \| md127 By running mdadm assemble with `--config /dev/null`, we prevent consideration and usage of a possibly existing /etc/mdadm/mdadm.conf configuration file. Example output of running the new code: \| [...] \| mdadm: No arrays found in config file or automatically \| NOTE: default SWRAID_DEVICE set to /dev/md0 though we identified active md127 \| NOTE: will continue with '/dev/md127' as SWRAID_DEVICE for mdadm cleanup \| Wiping signatures from /dev/md127 \| /dev/md127: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31 \| Removing mdadm device /dev/md127 \| Stopping mdadm device /dev/md127 \| mdadm: stopped /dev/md127 \| Zero-ing superblock from /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| Zero-ing superblock from /dev/nvme0n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| NOTE: modified RAID array detected, setting SWRAID_DEVICE back to original setting '/dev/md0' \| Removing possibly existing LVM/PV label from /dev/nvme1n1 \| Cannot use /dev/nvme1n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme1n1p1 \| Cannot use /dev/nvme1n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme1n1p2 \| Labels on physical volume "/dev/nvme1n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme1n1p3 \| Cannot use /dev/nvme1n1p3: device is an md component \| Wiping disk signatures from /dev/nvme1n1 \| /dev/nvme1n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme1n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0027866 s, 376 MB/s \| Removing possibly existing LVM/PV label from /dev/nvme0n1 \| Cannot use /dev/nvme0n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme0n1p1 \| Cannot use /dev/nvme0n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme0n1p2 \| Labels on physical volume "/dev/nvme0n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme0n1p3 \| Cannot use /dev/nvme0n1p3: device is an md component \| Wiping disk signatures from /dev/nvme0n1 \| /dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme0n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00278955 s, 376 MB/s \| Creating partition table \| Get path of EFI partition \| pvdevice is now available: /dev/nvme1n1p2 \| The operation has completed successfully. \| The operation has completed successfully. \| pvdevice is now available: /dev/nvme1n1p3 \| pvdevice is now available: /dev/nvme0n1p3 \| mdadm: /dev/nvme1n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Jan 24 10:31:43 2024 \| mdadm: Note: this array has metadata at the start and \| may not be suitable as a boot device. If you plan to \| store '/boot' on this device please ensure that \| your boot-loader understands md/v1.x metadata, or use \| --metadata=0.90 \| mdadm: /dev/nvme0n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Jan 24 10:31:43 2024 \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Creating PV + VG on /dev/md0 \| Physical volume "/dev/md0" successfully created. \| Volume group "ngcp" successfully created \| 0 logical volume(s) in volume group "ngcp" now active \| Creating LV 'root' with 10G \| [...] \| \| mdadm: stopped /dev/md127 \| mdadm: No arrays found in config file or automatically \| NOTE: will continue with '/dev/md127' as SWRAID_DEVICE for mdadm cleanup \| Removing mdadm device /dev/md127 \| Stopping mdadm device /dev/md127 \| mdadm: stopped /dev/md127 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| mdadm: /dev/nvme1n1p3 appears to be part of a raid array: \| mdadm: Note: this array has metadata at the start and \| mdadm: /dev/nvme0n1p3 appears to be part of a raid array: \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| lvm2 mdadm wget \| Get:1 http://http-proxy.lab.sipwise.com/debian bookworm/main amd64 mdadm amd64 4.2-5 [443 kB] \| Selecting previously unselected package mdadm. \| Preparing to unpack .../0-mdadm_4.2-5_amd64.deb ... \| Unpacking mdadm (4.2-5) ... \| Setting up mdadm (4.2-5) ... \| [...] \| mdadm: stopped /dev/md0 Change-Id: Ib5875248e9c01dd4251bfab2cc4c94daace503fa	1 year ago
Michael Prokop	e9244a289b	TT#118659 Wipe disk signatures more reliably with SW-RAID and NVMe setup Deployed current NGCP trunk on NVMe powered SW-RAID setup failed with: \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Creating PV + VG on /dev/md0 \| Cannot use /dev/md0: device is partitioned This is caused because /dev/md0 still contains partition data, and its nvme1n1p3 also still has disk signature about linux_raid_member. So it's not enough to stop the mdadm array, remove PV/LVM information from the partitions and finally wipe SW-RAID disks /dev/nvme1n1 + /dev/nvme0n1 (example output from such a failing run): \| mdadm: /dev/md/0 has been started with 2 drives. \| mdadm: stopped /dev/md0 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| Removing possibly existing LVM/PV label from /dev/nvme1n1 \| Cannot use /dev/nvme1n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme1n1p1 \| Cannot use /dev/nvme1n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme1n1p2 \| Labels on physical volume "/dev/nvme1n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme1n1p3 \| Cannot use /dev/nvme1n1p3: device is an md component \| Wiping disk signatures from /dev/nvme1n1 \| /dev/nvme1n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme1n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme1n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00314195 s, 334 MB/s \| Removing possibly existing LVM/PV label from /dev/nvme0n1 \| Cannot use /dev/nvme0n1: device is partitioned \| Removing possibly existing LVM/PV label from /dev/nvme0n1p1 \| Cannot use /dev/nvme0n1p1: device is too small (pv_min_size) \| Removing possibly existing LVM/PV label from /dev/nvme0n1p2 \| Labels on physical volume "/dev/nvme0n1p2" successfully wiped. \| Removing possibly existing LVM/PV label from /dev/nvme0n1p3 \| Cannot use /dev/nvme0n1p3: device is an md component \| Wiping disk signatures from /dev/nvme0n1 \| /dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 52 54 \| /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa \| /dev/nvme0n1: calling ioctl to re-read partition table: Success \| 1+0 records in \| 1+0 records out \| 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00893285 s, 117 MB/s \| Creating partition table \| Get path of EFI partition \| pvdevice is now available: /dev/nvme1n1p2 \| The operation has completed successfully. \| The operation has completed successfully. \| pvdevice is now available: /dev/nvme1n1p3 \| pvdevice is now available: /dev/nvme0n1p3 \| mdadm: /dev/nvme1n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Dec 20 20:35:21 2023 \| mdadm: Note: this array has metadata at the start and \| may not be suitable as a boot device. If you plan to \| store '/boot' on this device please ensure that \| your boot-loader understands md/v1.x metadata, or use \| --metadata=0.90 \| mdadm: /dev/nvme0n1p3 appears to be part of a raid array: \| level=raid1 devices=2 ctime=Wed Dec 20 20:35:21 2023 \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Creating PV + VG on /dev/md0 \| Cannot use /dev/md0: device is partitioned Instead we also need to wipe signatures from the SW-RAID device (like /dev/md0), only then stop it, ensure we wipe disk signatures also from all the partitions (like /dev/nvme1n1p3) and only then finally remove the disk signatures from the main block device (like /dev/nvme1n1). Example from a successful run with this change: \| root@grml ~ # grep -e mdadm -e Wiping /tmp/deployment-installer-debug.log \| mdadm: /dev/md/0 has been started with 2 drives. \| Wiping signatures from /dev/md0 \| Removing mdadm device /dev/md0 \| Stopping mdadm device /dev/md0 \| mdadm: stopped /dev/md0 \| mdadm: Unrecognised md component device - /dev/nvme1n1 \| mdadm: Unrecognised md component device - /dev/nvme0n1 \| Wiping disk signatures from partition /dev/nvme1n1p1 \| Wiping disk signatures from partition /dev/nvme1n1p2 \| Wiping disk signatures from partition /dev/nvme1n1p3 \| Wiping disk signatures from /dev/nvme1n1 \| Wiping disk signatures from partition /dev/nvme0n1p1 \| Wiping disk signatures from partition /dev/nvme0n1p2 \| Wiping disk signatures from partition /dev/nvme0n1p3 \| Wiping disk signatures from /dev/nvme0n1 \| mdadm: Note: this array has metadata at the start and \| mdadm: size set to 468218880K \| mdadm: automatically enabling write-intent bitmap on large array \| Continue creating array? mdadm: Defaulting to version 1.2 metadata \| mdadm: array /dev/md0 started. \| Wiping ext3 signature on /dev/ngcp/root. \| Wiping ext4 signature on /dev/ngcp/fallback. \| Wiping ext4 signature on /dev/ngcp/data. While at it, be more verbose about the executed steps. FTR, disk and setup information of such a system where we noticed the failure and worked on this change: \| root@grml ~ # fdisk -l \| Disk /dev/nvme0n1: 447.13 GiB, 480103981056 bytes, 937703088 sectors \| Disk model: DELL NVME ISE PE8010 RI M.2 480GB \| Units: sectors of 1 * 512 = 512 bytes \| Sector size (logical/physical): 512 bytes / 512 bytes \| I/O size (minimum/optimal): 512 bytes / 512 bytes \| Disklabel type: gpt \| Disk identifier: 5D296676-52CF-49CF-863A-6D3A3BD0604F \| \| Device Start End Sectors Size Type \| /dev/nvme0n1p1 2048 4095 2048 1M BIOS boot \| /dev/nvme0n1p2 4096 999423 995328 486M EFI System \| /dev/nvme0n1p3 999424 937701375 936701952 446.7G Linux RAID \| \| \| Disk /dev/nvme1n1: 447.13 GiB, 480103981056 bytes, 937703088 sectors \| Disk model: DELL NVME ISE PE8010 RI M.2 480GB \| Units: sectors of 1 * 512 = 512 bytes \| Sector size (logical/physical): 512 bytes / 512 bytes \| I/O size (minimum/optimal): 512 bytes / 512 bytes \| Disklabel type: gpt \| Disk identifier: 9AFA8ACF-D2CD-4224-BA0C-D38A6581D0F9 \| \| Device Start End Sectors Size Type \| /dev/nvme1n1p1 2048 4095 2048 1M BIOS boot \| /dev/nvme1n1p2 4096 999423 995328 486M EFI System \| /dev/nvme1n1p3 999424 937701375 936701952 446.7G Linux RAID \| [...] \| \| root@grml ~ # lsblk \| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS \| loop0 7:0 0 428.8M 1 loop /usr/lib/live/mount/rootfs/ngcp.squashfs \| /run/live/rootfs/ngcp.squashfs \| nvme0n1 259:0 0 447.1G 0 disk \| ├─nvme0n1p1 259:5 0 1M 0 part \| ├─nvme0n1p2 259:8 0 486M 0 part \| └─nvme0n1p3 259:9 0 446.7G 0 part \| └─md0 9:0 0 446.5G 0 raid1 \| ├─ngcp-root 253:0 0 10G 0 lvm /mnt \| ├─ngcp-fallback 253:1 0 10G 0 lvm \| └─ngcp-data 253:2 0 383.9G 0 lvm /mnt/ngcp-data \| nvme1n1 259:4 0 447.1G 0 disk \| ├─nvme1n1p1 259:2 0 1M 0 part \| ├─nvme1n1p2 259:6 0 486M 0 part \| └─nvme1n1p3 259:7 0 446.7G 0 part \| └─md0 9:0 0 446.5G 0 raid1 \| ├─ngcp-root 253:0 0 10G 0 lvm /mnt \| ├─ngcp-fallback 253:1 0 10G 0 lvm \| └─ngcp-data 253:2 0 383.9G 0 lvm /mnt/ngcp-data \| \| root@grml ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| md0 : active raid1 nvme0n1p3[1] nvme1n1p3[0] \| 468218880 blocks super 1.2 [2/2] [UU] \| [==>..................] resync = 12.7% (59516864/468218880) finish=33.1min speed=205685K/sec \| bitmap: 4/4 pages [16KB], 65536KB chunk \| \| unused devices: <none> Change-Id: Iaa7f49eef11ef6ad6209fe962bb8940a75a87c95	1 year ago
Michael Prokop	236cb2d1a7	MT#58926 Vagrant: ensure to have libxmu6 available We get the following error message in /var/log/vboxadd-install.log, /var/log/deployment-installer-debug.log, /var/log/daemon.log + /var/log/syslog: \| /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient: error while loading shared libraries: libXmu.so.6: cannot open shared object file: No such file or directory This is caused by missing libxmu6: \| [sipwise-lab-trunk] sipwise@spce:~$ /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient --help \| /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient: error while loading shared libraries: libXmu.so.6: cannot open shared object file: No such file or directory \| [sipwise-lab-trunk] sipwise@spce:~$ sudo apt install libxmu6 \| Reading package lists... Done \| Building dependency tree... Done \| Reading state information... Done \| The following NEW packages will be installed: \| libxmu6 \| 0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded. \| Need to get 60.1 kB of archives. \| After this operation, 143 kB of additional disk space will be used. \| Get:1 https://debian.sipwise.com/debian bookworm/main amd64 libxmu6 amd64 2:1.1.3-3 [60.1 kB] \| Fetched 60.1 kB in 0s (199 kB/s) \| [...] \| [sipwise-lab-trunk] sipwise@spce:~$ /opt/VBoxGuestAdditions-7.0.6/bin/VBoxClient --help \| Oracle VM VirtualBox VBoxClient 7.0.6 \| Copyright (C) 2005-2023 Oracle and/or its affiliates \| \| Usage: VBoxClient --clipboard\|--draganddrop\|--checkhostversion\|--seamless\|--vmsvga\|--vmsvga-session \| [-d\|--nodaemon] \| \| Options: \| [...] It looks like lack of libxmu6 doesn't cause any actual problems for our use case (we don't use X.org at all), though given that libxmu6 is a small library package, let's try to get it working as expected and avoid the alarming errors on the logs. Thanks Guillem Jover for spotting and reporting Change-Id: I65f3dd496a4026f04fd9944fd7cc43d6abbdf336	1 year ago
Michael Prokop	8c3ab6b241	MT#57559 Always include zstd when bootstrapping systems During initial deployment of a system, we get warnings about lack of zstd: \| Setting up linux-image-6.1.0-13-amd64 (6.1.55-1) ... \| I: /vmlinuz.old is now a symlink to boot/vmlinuz-6.1.0-13-amd64 \| I: /initrd.img.old is now a symlink to boot/initrd.img-6.1.0-13-amd64 \| I: /vmlinuz is now a symlink to boot/vmlinuz-6.1.0-13-amd64 \| I: /initrd.img is now a symlink to boot/initrd.img-6.1.0-13-amd64 \| /etc/kernel/postinst.d/initramfs-tools: \| update-initramfs: Generating /boot/initrd.img-6.1.0-13-amd64 \| W: No zstd in /usr/bin:/sbin:/bin, using gzip \| [...] The initramfs generation and update overall runs four times within the initial bootstrapping of a system (we'll try to do something about this, but this is outside the scope of this). As of initramfs-tools v0.141, initramfs-tools uses zstd as default compression for initramfs. Version 0.142 is shipped with Debian/bookworm, and therefore it makes sense to have it available upfront. Note that also the initrd generation is faster with zstd (~10sec for zstd vs. ~13sec for gzip) and also the resulting initrd is smaller (~33MB for zstd vs ~39MB for gzip). By making sure that zstd is available straight from the very beginning and before ngcp-installer pulls it in later, we can avoid the warning message but also save >10 seconds of install time. Given that zstd is available even in Debian oldoldstable, let's install it unconditionally in all our systems. Thanks: Volodymyr Fedorov for reporting Change-Id: I56674c3c213f7c7a6e6cbce3c8e2e00a4cfbdbd4	1 year ago
Guillem Jover	9cceb8d655	MT#58356 ntp: Use ntpsec.service instead of ntp.service Even though the ntpsec.service contains an Alias for ntp.service, that does not work for us when the service has not yet been installed, so the first run will fail. Use the actual name to avoid this issue. Change-Id: I8f0ee3b38390a7e58c3bbee65fd96bfd4b717dfa	2 years ago
Sipwise Jenkins Builder	366c412c1f	MT#57980 Add mr11.5 LTS key to bootstrap Now it contains: pub rsa4096 2015-03-05 [SC] [expires: 2029-10-12] 68A702B1FD8E422AAAA1ADA3773236EFF411A836 uid [ unknown] Sipwise GmbH (Sipwise Repository Key) <support@sipwise.com> sub rsa4096 2015-03-05 [E] [expires: 2029-10-12] pub rsa4096 2011-06-06 [SC] F7B8A739CE638D719A078C9859104633EE5E097D uid [ unknown] Sipwise autobuilder (Used to sign packages for autobuild) <development@sipwise.com> sub rsa4096 2011-06-06 [E] pub rsa4096 2022-05-31 [SCEA] [expires: 2032-05-28] 39EB73D5B54870181632E48786C3B4395CB844A2 uid [ unknown] Sipwise autobuilder <development@sipwise.com> pub rsa4096 2023-08-04 [SCEA] [expires: 2033-08-01] F0A595D85C375447BB09F25E34A72CE4979CA98A uid [ unknown] Sipwise autobuilder <development@sipwise.com> pub rsa4096 2021-05-04 [SCEA] [expires: 2031-05-02] AB7FE3DCD53767F6160406442A5CA71B542B9A22 uid [ unknown] Sipwise autobuilder <development@sipwise.com> Change-Id: I33c8a4e666f1a7f8b64d823c3d4e2550ca8dcf11	2 years ago
Michael Prokop	793a93bc43	MT#57453 vagrant_configuration: remove fake systemd presence after execution Let's restore system state of /run/systemd/system for VBoxLinuxAdditions, to avoid any unexpected side effects. Followup for git rev `8601193` Change-Id: I632c7d60ebb627c3a80d4c1f9b264d6d0a13b4f1	2 years ago
Michael Prokop	561303359e	MT#57453 Use tty1 for stdin when running under grml-autoconfig service Recent Grml ISOs, including our Grml-Sipwise ISO (v2023-06-01), include grml-autoconfig v0.20.3 which execute the grml-autoconfig service under `StandardInput=null`. This is necessary to not conflict with tty usage, like used with serial console. See `1e268ffe4f` Now that we run with /dev/null for stdin, we can't interact with the user, so let's try to detect when running from within grml-autoconfig's systemd unit, and if so assume that we're executing on /dev/tty1 and use/reopen that for stdin. Change-Id: Id55283c7f862487a6ef8acb8ab01f67a05bd8dd7	2 years ago
Michael Prokop	8601193128	MT#57453 vagrant_configuration: fake systemd presence As of git rev `6c960afee4` we're using the virtualbox-guest-additions-iso from bookworm. Previous versions of VBoxGuestAdditions had a simple test to check for present of systemd, quoting from /opt/VBoxGuestAdditions-6.1.22/routines.sh: \| use_systemd() \| { \| test ! -f /sbin/init \|\| test -L /sbin/init \| } Now in more recent versions of VBoxGuestAdditions[1], the systemd check was modified, quoting from /opt/VBoxGuestAdditions-7.0.6/routines.sh: \| use_systemd() \| { \| # First condition is what halfway recent systemd uses itself, and the \| # other two checks should cover everything back to v1. \| test -e /run/systemd/system \|\| test -e /sys/fs/cgroup/systemd \|\| test -e /cgroup/systemd \| } So if we're running inside a chroot as with our deployment.sh, it looks like a non-systemd system for VBoxGuestAdditions's installer, and we end up with installation and presence of /etc/init.d/vboxadd, leading to: \| root@spce:~# ls -lah /run/systemd/generator.late/ \| total 4.0K \| drwxr-xr-x 4 root root 100 Jul 18 00:20 . \| drwxr-xr-x 23 root root 580 Jul 18 00:20 .. \| drwxr-xr-x 2 root root 60 Jul 18 00:20 graphical.target.wants \| drwxr-xr-x 2 root root 60 Jul 18 00:20 multi-user.target.wants \| -rw-r--r-- 1 root root 537 Jul 18 00:20 vboxadd.service \| \| root@spce:~# systemctl cat vboxadd.service \| # /run/systemd/generator.late/vboxadd.service \| # Automatically generated by systemd-sysv-generator \| \| [Unit] \| Documentation=man:systemd-sysv-generator(8) \| SourcePath=/etc/init.d/vboxadd \| Description=LSB: VirtualBox Linux Additions kernel modules \| Before=multi-user.target \| Before=multi-user.target \| Before=multi-user.target \| Before=graphical.target \| Before=display-manager.service \| \| [Service] \| Type=forking \| Restart=no \| TimeoutSec=5min \| IgnoreSIGPIPE=no \| KillMode=process \| GuessMainPID=no \| RemainAfterExit=yes \| SuccessExitStatus=5 6 \| ExecStart=/etc/init.d/vboxadd start \| ExecStop=/etc/init.d/vboxadd stop We don't expect any init scripts to be present, as all our services must have systemd unit files. Therefore we check for absence of systemd's /run/systemd/generator.late in our system-tests, which started to fail with the upgrade to VBoxGuestAdditions-v7.0.6 due to the systemd presence detection mentioned above. Let's fake presence of systemd before invoking VBoxGuestAdditions's installer, to avoid ending up with unexpected vbox* init scripts. [1] See svn rev 92682: https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Installer/linux/routines.sh?rev=92682 https://www.virtualbox.org/changeset?old=92681&old_path=vbox%2Ftrunk%2Fsrc%2FVBox%2FInstaller%2Flinux%2Froutines.sh&new=92682&new_path=vbox%2Ftrunk%2Fsrc%2FVBox%2FInstaller%2Flinux%2Froutines.sh Change-Id: Ifd11460e3a8fd4f4c1269453a9b8376065861b8e	2 years ago
Victor Seva	6c960afee4	TT#104221 Use bookworm repos in ensure_packages_installed appropriately Support bookworm option in DEBIAN_RELEASE selection. We have support for it already. Use bookworm as fallback since nowadays we jumped to it. Change-Id: I118c1b5cf81fe57394495b5f745fc81032406c78	2 years ago
Michael Prokop	37163532ee	MT#56773 Use bullseye puppetlabs repository for bookworm To be able to upgrade our internal systems to Debian/bookworm we need to have puppet packages available. Upstream still doesn't provide any Debian packages (see https://tickets.puppetlabs.com/browse/PA-4995), though their AIO (All In One) packages for Debian/bullseye seem to be working on Debian/bookworm as well (at least for puppet-agent). So until we either migrated to puppet-agent as present in Debian/bookworm or upstream provides according AIO packages, let's use the puppet-agent packages we already use for our Debian/bullseye systems. Change-Id: I2211ffd79f70a2a79873e737b0b512bfb7492328	2 years ago
Michael Prokop	0fedba6144	MT#57643 Ensure /var/lib/dpkg/available exists on Debian releases <=buster Since version 1.20.0, dpkg no longer creates /var/lib/dpkg/available (see #647911). Now that we upgraded our Grml-Sipwise deployment system to bookworm, we have dpkg v1.21.22 on our live system, and mmdebstrap relies on dpkg of the host system for execution. But on Debian releases until and including buster, dpkg fails to operate with e.g. `dpkg --set-selections`, if /var/lib/dpkg/available doesn't exist: \| The following NEW packages will be installed: \| nullmailer \| [...] \| debconf: delaying package configuration, since apt-utils is not installed \| dpkg: error: failed to open package info file '/var/lib/dpkg/available' for reading: No such file or directory We could also switch from mmdebstrap to debootstrap for deploying Debian releases <=buster, but this would be slower and we use mmdebstrap since quite some time for everything. So instead let's create /var/lib/dpkg/available after bootstrapping the system. Reported towards mmdebstrap as #1037946. Change-Id: I0a87ca255d5eb7144a9c093051c0a6a3114a3c0b	2 years ago
Michael Prokop	eccdc586ae	MT#57644 puppet/git: allow ssh-rsa pubkey usage Now that our deployment system is based on Debian/bookworm, but our gerrit/git server still runs on Debian/bullseye, we run into the OpenSSH RSA issue (RSA signatures using the SHA-1 hash algorithm got disabled by default), see https://michael-prokop.at/blog/2023/06/11/what-to-expect-from-debian-bookworm-newinbookworm/ and https://www.jhanley.com/blog/ssh-signature-algorithm-ssh-rsa-error/ We need to enable ssh-rsa usage, otherwise deployment fails with: \| Warning: Permanently added '[gerrit.mgm.sipwise.com]:29418' (ED25519) to the list of known hosts. \| sign_and_send_pubkey: no mutual signature supported \| puppet-r10k@gerrit.mgm.sipwise.com: Permission denied (publickey). \| fatal: Could not read from remote repository. Change-Id: I5894170dab033d52a2612beea7b6f27ab06cc586	2 years ago
Michael Prokop	8cfb8c8392	MT#57630 Check online connectivity to work around Intel E810 / ice issue Deploying the Debian/bookworm based NGCP system fails on a Lenovo sr250 v2 node with an Intel E810 network card: \| # lshw -c net -businfo \| Bus info Device Class Description \| ======================================================= \| pci@0000:01:00.0 eth0 network Ethernet Controller E810-XXV for SFP \| pci@0000:01:00.1 eth1 network Ethernet Controller E810-XXV for SFP \| # lshw -c net \| -network:0 \| description: Ethernet interface \| product: Ethernet Controller E810-XXV for SFP \| vendor: Intel Corporation \| physical id: 0 \| bus info: pci@0000:01:00.0 \| logical name: eth0 \| version: 02 \| serial: [...] \| size: 10Gbit/s \| capacity: 25Gbit/s \| width: 64 bits \| clock: 33MHz \| capabilities: pm msi msix pciexpress vpd bus_master cap_list rom ethernet physical fibre 1000bt-fd 25000bt-fd \| configuration: autonegotiation=off broadcast=yes driver=ice driverversion=1.11.14 duplex=full firmware=2.25 0x80007027 1.2934.0 ip=192.168.90.51 latency=0 link=yes multicast=yes port=fibre speed=10Gbit/s \| resources: iomemory:400-3ff iomemory:400-3ff irq:16 memory:4002000000-4003ffffff memory:4006010000-400601ffff memory:a1d00000-a1dfffff memory:4005000000-4005ffffff memory:4006220000-400641ffff We set up the /etc/network/interfaces file by invoking Grml's netcardconfig script in automated mode, like: NET_DEV=eth0 METHOD=static IPADDR=192.168.90.51 NETMASK=255.255.255.248 GATEWAY=192.168.90.49 /usr/sbin/netcardconfig The resulting /etc/network/interfaces gets used as base for usage inside the NGCP chroot/target system. netcardconfig shuts down the network interface (eth0 in the example above) via ifdown, then sleeps for 3 seconds and re-enables the interface (via ifup) with the new configuration. This used to work fine so far, but with the Intel e810 network card and kernel version 6.1.0-9-amd64 from Debian/bookworm we see a link failure and it takes ~10 seconds until the network device is up and running again. The following vagrant_configuration() execution from deployment.sh then fails: \| +11:41:01 (netscript.grml:1022): vagrant_configuration(): wget -O /var/tmp/id_rsa_sipwise.pub http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub \| --2023-06-11 11:41:01-- http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub \| Resolving builder.mgm.sipwise.com (builder.mgm.sipwise.com)... failed: Name or service not known. \| wget: unable to resolve host address 'builder.mgm.sipwise.com' However, when we retry it again just a bit later, the network works fine again. During investigation we identified that the network card flips the port, quoting the related log from the connected Cisco nexus 5020 switch (with fast stp learning mode): \| nexus5k %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/33 is down (Link failure) It seems to be related to some autonegotiation problem, as when we execute `ethtool -A eth0 rx on tx on` (no matter whether with `on` or `off`), we see: \| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: Autoneg did not complete so changing settings may not result in an actual change. \| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: NIC Link is Down \| [Tue Jun 13 08:51:45 2023] ice 0000:01:00.0 eth0: NIC Link is up 10 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: NONE, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: Rx/Tx FTR: \| root@sp1 ~ # ethtool -A eth0 autoneg off \| netlink error: Operation not supported \| 76 root@sp1 ~ # ethtool eth0 \| grep -C1 Auto-negotiation \| Duplex: Full \| Auto-negotiation: off \| Port: FIBRE \| root@sp1 ~ # ethtool -A eth0 autoneg on \| root@sp1 ~ # ethtool eth0 \| grep -C1 Auto-negotiation \| Duplex: Full \| Auto-negotiation: off \| Port: FIBRE \| root@sp1 ~ # dmesg -T \| tail -1 \| [Tue Jun 13 08:53:26 2023] ice 0000:01:00.0 eth0: To change autoneg please use: ethtool -s <dev> autoneg <on\|off> \| root@sp1 ~ # ethtool -s eth0 autoneg off \| root@sp1 ~ # ethtool -s eth0 autoneg on \| netlink error: link settings update failed \| netlink error: Operation not supported \| 75 root@sp1 ~ # As a workaround, at least until we have a better fix/solution, we try to reach the default gateway (or fall back to the repository host if gateway couldn't be identified) via ICMP/ping, and once that works we we continue as usual. But even if that should fail we continue execution, to minimize behavior change but have a workaround for this specific situation available. FTR, broken system: \| root@sp1 ~ # ethtool -i eth0 \| driver: ice \| version: 6.1.0-9-amd64 \| firmware-version: 2.25 0x80007027 1.2934.0 \| [...] Whereas with kernel 5.10.0-23-amd64 from Debian/bullseye we don't seem to see that behavior: \| root@sp1:~# ethtool -i neth0 \| driver: ice \| version: 5.10.0-23-amd64 \| firmware-version: 2.25 0x80007027 1.2934.0 \| [...] Also using latest available ice v1.11.14 (from https://sourceforge.net/projects/e1000/files/ice%20stable/1.11.14/) on Kernel version 6.1.0-9-amd64 doesn't bring any change: \| root@sp1 ~ # modinfo ice \| filename: /lib/modules/6.1.0-9-amd64/updates/drivers/net/ethernet/intel/ice/ice.ko \| firmware: intel/ice/ddp/ice.pkg \| version: 1.11.14 \| license: GPL v2 \| description: Intel(R) Ethernet Connection E800 Series Linux Driver \| author: Intel Corporation, <linux.nics@intel.com> \| srcversion: 818E9C817731C98A25470C0 \| alias: pci:v00008086d00001888svsdbcsci \| [...] \| alias: pci:v00008086d00001591svsdbcsci* \| depends: ptp \| retpoline: Y \| name: ice \| vermagic: 6.1.0-9-amd64 SMP preempt mod_unload modversions \| parm: debug:netif level (0=none,...,16=all) (int) \| parm: fwlog_level:FW event level to log. All levels <= to the specified value are enabled. Values: 0=none, 1=error, 2=warning, 3=normal, 4=verbose. Invalid values: >=5 \| (ushort) \| parm: fwlog_events:FW events to log (32-bit mask) \| (ulong) \| root@sp1 ~ # ethtool -i eth0 \| head -3 \| driver: ice \| version: 1.11.14 \| firmware-version: 2.25 0x80007027 1.2934.0 \| root@sp1 ~ # Change-Id: Ieafe648be4e06ed0d936611ebaf8ee54266b6f3c	2 years ago
Michael Prokop	f4da3e094e	MT#57049 Ensure SW-RAID device is inactive before re-reading partition table Re-reading of disks fails if the mdadm SW-RAID device is still active: \| root@sp1 ~ # cat /proc/mdstat \| Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \| md0 : active raid1 sdb3[1] sda3[0] \| 468218880 blocks super 1.2 [2/2] [UU] \| [========>............] resync = 42.2% (197855168/468218880) finish=22.4min speed=200756K/sec \| bitmap: 3/4 pages [12KB], 65536KB chunk \| \| unused devices: <none> \| root@sp1 ~ # blockdev --rereadpt /dev/sdb \| blockdev: ioctl error on BLKRRPART: Device or resource busy \| 1 root@sp1 ~ # blockdev --rereadpt /dev/sda \| blockdev: ioctl error on BLKRRPART: Device or resource busy \| 1 root@sp1 ~ # Only if we stop the mdadm SW-RAID device, then we can re-read the partition table: \| root@sp1 ~ # mdadm --stop /dev/md0 \| mdadm: stopped /dev/md0 \| root@sp1 ~ # blockdev --rereadpt /dev/sda \| root@sp1 ~ # This behavior isn't new and unrelated to Debian/bookworm but was spotted while debugging an unrelated issue. FTR: we re-read the partition table (via `blockdev --rereadpt`) to ensure that /etc/fstab of the live system is up2date and matches the current system state. While this isn't stricly needed, we preserve existing behavior and also try to avoid a hard "cut" of a possibly ongoing SW-RAID sync. Change-Id: I735b00423e6efa932f74b78a38ed023576e5d306	2 years ago
Michael Prokop	2ad306c465	MT#57556 Prompt for reboot/halt only in interactive mode With our newer Grml-Sipwise ISO (v2023-06-01) being based on Debian/bookworm and recent Grml packages, our automated deployment suddenly started to fail for us: \| +04:28:12 (netscript.grml:2453): echo 'Successfully finished deployment process [Fri Jun 2 04:28:12 UTC 2023 - running 576 seconds]' \| ++04:28:12 (netscript.grml:2455): get_deploy_status \| ++04:28:12 (netscript.grml:95): get_deploy_status(): '[' -r /srv/deployment//status ']' \| ++04:28:12 (netscript.grml:96): get_deploy_status(): cat /srv/deployment//status \| Successfully finished deployment process [Fri Jun 2 04:28:12 UTC 2023 - running 576 seconds] \| +04:28:12 (netscript.grml:2455): '[' copylogfiles '!=' error ']' \| +04:28:12 (netscript.grml:2456): set_deploy_status finished \| +04:28:12 (netscript.grml:103): set_deploy_status(): '[' -n finished ']' \| +04:28:12 (netscript.grml:104): set_deploy_status(): echo finished \| +04:28:12 (netscript.grml:2459): false \| +04:28:12 (netscript.grml:2463): status_wait \| +04:28:12 (netscript.grml:329): status_wait(): [[ -n 0 ]] \| +04:28:12 (netscript.grml:329): status_wait(): [[ 0 != 0 ]] \| +04:28:12 (netscript.grml:2466): false \| +04:28:12 (netscript.grml:2471): false \| +04:28:12 (netscript.grml:2476): echo 'Do you want to [r]eboot or [h]alt the system now? (Press any other key to cancel.)' \| Do you want to [r]eboot or [h]alt the system now? (Press any other key to cancel.) \| +04:28:12 (netscript.grml:2477): unset a \| +04:28:12 (netscript.grml:2478): read -r a \| ++04:28:12 (netscript.grml:2478): wait_exit \| ++04:28:12 (netscript.grml:339): wait_exit(): local e_code=1 \| ++04:28:12 (netscript.grml:340): wait_exit(): [[ 1 -ne 0 ]] \| ++04:28:12 (netscript.grml:341): wait_exit(): set_deploy_status error \| ++04:28:12 (netscript.grml:103): set_deploy_status(): '[' -n error ']' \| ++04:28:12 (netscript.grml:104): set_deploy_status(): echo error \| ++04:28:12 (netscript.grml:343): wait_exit(): trap '' 1 2 3 6 15 ERR EXIT \| ++04:28:12 (netscript.grml:344): wait_exit(): status_wait \| ++04:28:12 (netscript.grml:329): status_wait(): [[ -n 0 ]] \| ++04:28:12 (netscript.grml:329): status_wait(): [[ 0 != 0 ]] \| ++04:28:12 (netscript.grml:345): wait_exit(): exit 1 As of grml-autoconfig v0.20.3 and newer, the grml-autoconfig systemd service that invokes the deployment netscript uses `StandardInput=null` instead of `StandardInput=tty` (see https://github.com/grml/grml/issues/176). Thanks to this, a logic error in our deployment script showed up. We exit the script in interactive mode, though only afterwards prompting for reboot/halt with `read -r a` - which of course fails if stdin is missing. As a result, we end up in our signal handler `trap 'wait_exit;' 1 2 3 6 15 ERR EXIT` and then fail the deployment. So instead prompt for "Do you want to [r]eboot or [h]alt ..." only in interactive mode, and while at it drop the "if "$INTERACTIVE" ; then exit 0 ; fi" so the prompt is actually presented to the user. Change-Id: Ia89beaf3c446f3701cc30ab21cfdff7b5808a6d3	2 years ago
Michael Prokop	98d11bfc28	MT#57280 Run deployment status server under systemd Manual execution of python's http.server has multiple drawbacks, like no proper logging and no service tracking/restart options, but most notably the deployment status server no longer runs when our deployment script fails. While /srv/deployment/status then still might contain "error", no one is serving that information on port 4242 any longer[1], and our daily-build-install-vm Jenkins job might then report: \| VM '192.168.209.162' current state is '' - retrying up to another 1646 times, sleeping for a second \| VM '192.168.209.162' current state is '' - retrying up to another 1645 times, sleeping for a second \| [...] It then runss for ~1/2 hour without doing anything useful, until the Jenkins job itself gives up. By running our deployment status server under systemd, we keep the service alive also when the deployment script terminates. In case of errors we get immediate feedback: \| VM '192.168.209.162' current state is 'puppet' - retrying up to another 1648 times, sleeping for a second \| VM '192.168.209.162' current state is 'puppet' - retrying up to another 1647 times, sleeping for a second \| VM '192.168.209.162' current state is 'error' - retrying up to another 1646 times, sleeping for a second \| + '[' error '!=' finished ']' \| + echo 'Failed to install Proxom VM '\''162'\'' (IP '\''192.168.209.162'\'')' [1] For our NGCP based installations we use the ngcpstatus boot option, where its status_wait trap kicks in and avoids premature exit of deployment status server. But e.g. our non-NGCP systems don't use that boot option and with this change we could get rid of the status_wait overall. Change-Id: Ibaa799358caedf31c64c37b48e3c5e889808086a	2 years ago
Michael Prokop	e6819fe674	MT#55944 Use ngcp-initialize-udev-rules-net to deploy 70-persistent-net.rules Use system-tools' ngcp-initialize-udev-rules-net script to deploy the /etc/udev/rules.d/70-persistent-net.rules, no need to maintain code at multiple places. Change-Id: I81925262a8c687aa9976cbc1113568989fa53281	2 years ago
Michael Prokop	ae7db13232	MT#55944 Fix networking for plain Debian systems When building our Debian boxes for buster, bullseye + bookworm (via daily-build-matrix-debian-boxes Jenkins job), we get broken networking, so e.g. `vagrant up debian-bookworm doesn't work. This is caused by /etc/network/interfaces (using e.g. "neth0", being our naming schema which we use in NGCP, as adjusted by the deployment script) not matching the actual system network devices (like enp0s3). TL;DR: no behavior change for NGCP systems, only when building non-NGCP systems then enable net.ifnames=0 (via set_custom_grub_boot_options), but do not generate /etc/udev/rules.d/70-persistent-net.rules (via invoke generate_udev_network_rules) nor rename eth->neth in /etc/network/interfaces. More verbose version: * rename the "eth" networking interfaces into "neth" in /etc/network/interfaces only when running in ngcp-installer mode (this is the behavior we rely on in NGCP, but it doesn't matter for plain Debian systems) * generate /etc/udev/rules.d/70-persistent-net.rules only when running in ngcp-installer mode. While our jenkins-configs.git's jobs/daily-build/scripts/vm_clean-fs.sh removes the file anyways (for the VM use case), between the initial deployment run and the next reboot the configuration inside the PVE VM still applies, so we end up with an existing /etc/udev/rules.d/70-persistent-net.rules, referring to neth0, while our /etc/network/interfaces configures eth0 instead. * when not running in ngcp-installer mode, enable net.ifnames=0 usage in GRUB to disable persistent network interface naming. FTR, this change is not needed for NGCP, as on NGCP systems we use /etc/udev/rules.d/70-persistent-net.rules, generated by ngcp-system-tools' ngcp-initialize-udev-rules-net script also in VM use case This is a fixup for a change in git commit `a50903a30c` (see also commit message of git commit `ab62171`), that should have been adjusted for ngcp-installer-only mode instead. Change-Id: I6d0021dbdc2c1587127f0e115c6ff9844460a761	2 years ago
Michael Prokop	6412814e6b	MT#55949 Ensure we have proper date/time configuration If the date of the running system isn't appropriate enough, then apt runs might fail with somehint like: \| E: Release file for https://deb/sipwise/com/spce/mr10.5.2/dists/bullseye/InRelease is not valid yet (invalid for another 6h 19min 2s) So let's try to sync date/time of the system via NTP. Given that chrony is a small (only 650 kB disk space) and secure replacement for ntp, let's ship chrony with the Grml deployment ISO (and fall back to ntp usage in deployment script if chrony shouldn't be available). Also, if the system is configured to read the RTC time in the local time zone, this is known as another source of problems, so let's make sure to use the RTC in UTC. Change-Id: I747665d1cee3b6f835c62812157d0203bcfa96e2	2 years ago
Michael Prokop	245c7ef702	MT#55861 Update Grml ISO + update to Debian/bookworm For deploying Debian/bookworm (see MT#55524), we'd like to have an updated Grml ISO. With such a Debian/bookworm based live system, we can still deploy older target systems (like Debian/bullseye). Relevant changes: 1) Ad jo as new build-dependency, to generate build information in conf/buildinfo.json (new dependency of grml-live) 2) Always include ca-certificates, as this is required with more recent mmdebstrap versions (>=0.8.0), when using apt repositories with https, otherwise bootstrapping Debian fails. 3) Update to latest stable grml-live version v0.42.0, which: a) added support for "bookworm" as suite name `cff66073a7` b) provides corresponding templates for memtest support: `c01a86b3fc` c) and a workaround for a kmod/initramfs-tools issue with PXE/NFS boot: `ea1e5ea330` 4) Update memtest86+ to v6.00-1 as present in Debian/bookworm and add corresponding UEFI support (based on grml-live's upstream change, though as we don't support i386, dropped the 32bit related bits) Change-Id: I327c0e25c28f46e097212ef4329d75fc8d34767c	2 years ago
Guillem Jover	ad9e94efb6	MT#55861 Load the fake-uname.so pre-loaded library from within the chroot We build the pre-loaded library targeting a specific Debian release, which might be different (and newer) to the release Grml was built for. This can cause missing versioned symbols (and a loading failure) if the libc in the outer system is older than the inner system. Change-Id: I84f4f307863e534fe0fff85274ae1d5db809012c	2 years ago
Michael Prokop	d1d0e61512	MT#55379 Use usrmerge for Debian/bookworm based systems The transition to usrmerge has started in Debian, see https://lists.debian.org/debian-devel-announce/2022/09/msg00001.html Debian/bookworm AKA v12 will only support the merged-/usr layout. Systemd is also dropping support for unmerged-usr systems (see https://lists.freedesktop.org/archives/systemd-devel/2022-September/048352.html). Deploy the expected filesystem layout accordingly, as in: 1) no-merged-usr for Debian release up and including bullseye, and 2) merged-usr starting with bookworm and newer Change-Id: I7b7b294ce12ca245cf978a787bcc20aa9753e73d	3 years ago
Michael Prokop	b372471a20	TT#15305 Fix ngcp-deployment-scripts usage for daily-build-matrix-debian-boxes Git commit `6661b04af0` broke all our bullseye based builds (debian, sipwise + docker), see https://jenkins.mgm.sipwise.com/view/All/job/daily-build-matrix-debian-boxes/ For plain Debian installations we don't have SP_VERSION available, so default to what was used before supporting trunk-weekly next to trunk. Change-Id: I61958f0c67d165d2f6dcb059fe4991ed24a328c9	3 years ago
Victor Seva	1d4f08b7ed	TT#15305 development.sh: support trunk-weekly, take two Change-Id: I83e635dc5916833d0699fd0be5a8a742ef7b40c8	3 years ago
Victor Seva	6661b04af0	TT#15305 deployment.sh: support trunk-weekly Change-Id: Ie98ac5fa0de848cf54a96039af5532eb8012bab9	3 years ago
Mykola Malkov	c177a98100	TT#179354 Add mr10.5 LTS key to bootstrap Now it contains: pub rsa4096 2015-03-05 [SC] [expires: 2029-10-12] 68A702B1FD8E422AAAA1ADA3773236EFF411A836 uid [ unknown] Sipwise GmbH (Sipwise Repository Key) <support@sipwise.com> sub rsa4096 2015-03-05 [E] [expires: 2029-10-12] pub rsa4096 2011-06-06 [SC] F7B8A739CE638D719A078C9859104633EE5E097D uid [ unknown] Sipwise autobuilder (Used to sign packages for autobuild) <development@sipwise.com> sub rsa4096 2011-06-06 [E] pub rsa4096 2021-05-04 [SCEA] [expires: 2031-05-02] AB7FE3DCD53767F6160406442A5CA71B542B9A22 uid [ unknown] Sipwise autobuilder <development@sipwise.com> pub rsa4096 2022-05-31 [SCEA] [expires: 2032-05-28] 39EB73D5B54870181632E48786C3B4395CB844A2 uid [ unknown] Sipwise autobuilder <development@sipwise.com> Change-Id: Ic851724f3580a4f6addbba41b42d97c02acf4ff2	3 years ago
Michael Prokop	8e063362ef	TT#173500 Create tmpfiles with template name We want to be able to track down any left-behind tmp files, so ensure we're creating them with according file names. Change-Id: I4eb44047f2eb86ba9f0a8aeeb8d6555290f60c00	3 years ago
Mykola Malkov	15aaad8edb	TT#161150 Replace ngcpsp* with ngcpnodename option It's needed for support of spN nodes. Sort options in deployment.sh. Remove unused boot options ngcpnonwrecfg and ngcpfillcache. Change-Id: I300e533c15b71d65e768ca2ed4b3a73eb7ec6954	3 years ago
Mykola Malkov	be237917d7	TT#161150 Refactor options parsing Merge all options parsing to single point. Move options parsing to the top of the script. Parse boot options first then cmd options if they exist. Simplify some checks. Remove unused options. Change-Id: Ibcb099d9bb2ba26ffed9904c8e5065b392ecb78a	3 years ago
Michael Prokop	f27f51c6c8	TT#165600 Add support for NVMe disks The logic to detect disks via /proc/partitions didn't cover NVMe disks, as the regex '[a-z]$' fails for the "nvme0n1" pattern: \| % cat /proc/partitions \| major minor #blocks name \| \| 259 0 500107608 nvme0n1 \| 259 1 524288 nvme0n1p1 \| 259 2 499582279 nvme0n1p2 \| [...] \| 8 0 384638976 sda \| 8 1 384606208 sda1 Instead, let's use lsblk to detect present disks, which works fine for all kinds of disks, incl. NVMe devices. Change-Id: I586877da8b4fadf3d05b4e6c8e88bfdeae6d7f15	3 years ago
Mykola Malkov	a99d9ff6e2	TT#161150 Refactoring default values and parameter parsing Sort default values. Rework cmd parameters parsing - remove some reassign, reformat to be more clear, etc. Add some default options CROLE, EADDR, EXTERNAL_NETMASK, ROLE. Change-Id: I287facafeb53dc5390517424935c8a50932246dc	3 years ago
Volodymyr Fedorov	7b53916c30	TT#157450 Add extra logging entries and copy logs later Add extra deployment statuses for grub-install and try to have more data logged. Change-Id: Id06dfad1264f781157631c51035ab219cfc30070	3 years ago
Michael Prokop	3073c27a40	TT#118659 EFI support: ensure to always have a proper FAT filesystem available If grml-debootstrap detects an existing FAT filesystem on the EFI partition, it doesn't modify/re-create it: \| EFI partition /dev/nvme0n1p2 seems to have a FAT filesystem, not modifying. The underlying check is execution of `fsck.vfat -bn $DEVICE`. Now with fsck.fat from dosfstools v4.1-2 as present in Debian/buster we got: \| root@grml ~ # fsck.vfat -bn /dev/nvme0n1p2 \| fsck.fat 4.1 (2017-01-24) \| 0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt. \| Automatically removing dirty bit. \| There are differences between boot sector and its backup. \| This is mostly harmless. Differences: (offset:original/backup) \| 0:00/eb, 82:00/46, 83:00/41, 84:00/54, 85:00/33, 86:00/32, 87:00/20 \| , 88:00/20, 89:00/20, 510:00/55, 511:00/aa \| Not automatically fixing this. \| Leaving filesystem unchanged. \| 1 root@grml ~ # Now with dosfstools v4.2-1 as present in Debian/bullseye, this might become: \| root@grml ~ # fsck.vfat -bn /dev/nvme0n1p2 \| fsck.fat 4.2 (2021-01-31) \| There are differences between boot sector and its backup. \| This is mostly harmless. Differences: (offset:original/backup) \| 0:00/eb, 65:01/00, 82:00/46, 83:00/41, 84:00/54, 85:00/33, 86:00/32 \| , 87:00/20, 88:00/20, 89:00/20, 510:00/55, 511:00/aa \| Not automatically fixing this. In such situations we end up with an incomplete/broken EFI partition, which breaks within our efivarfs post-script: \| Mounting /dev/nvme0n1p2 on /boot/efi \| mount: /boot/efi: wrong fs type, bad option, bad superblock on /dev/nvme0n1p2, missing codepage or helper program, or other error. \| ESC[31;01m-> Failed (rc=1)ESC[0m \| ESC[32;01mESC[0m Removing chroot-script again \| ESC[32;01mESC[0m Executing post-script /etc/debootstrap/post-scripts//efivarfs \| Executing /etc/debootstrap/post-scripts//efivarfs \| Mounting /dev (via bind mount) \| Mounting /boot/efi \| mount: /boot/efi: special device UUID= does not exist. Change-Id: I46939b4e191982a84792f3aca27c6cc415dbdaf4	4 years ago

1 2 3 4 5 ...

334 Commits (master)