Deploying the Debian/bookworm based NGCP system fails on a Lenovo sr250 v2 node with an Intel E810 network card: | # lshw -c net -businfo | Bus info Device Class Description | ======================================================= | pci@0000:01:00.0 eth0 network Ethernet Controller E810-XXV for SFP | pci@0000:01:00.1 eth1 network Ethernet Controller E810-XXV for SFP | # lshw -c net | *-network:0 | description: Ethernet interface | product: Ethernet Controller E810-XXV for SFP | vendor: Intel Corporation | physical id: 0 | bus info: pci@0000:01:00.0 | logical name: eth0 | version: 02 | serial: [...] | size: 10Gbit/s | capacity: 25Gbit/s | width: 64 bits | clock: 33MHz | capabilities: pm msi msix pciexpress vpd bus_master cap_list rom ethernet physical fibre 1000bt-fd 25000bt-fd | configuration: autonegotiation=off broadcast=yes driver=ice driverversion=1.11.14 duplex=full firmware=2.25 0x80007027 1.2934.0 ip=192.168.90.51 latency=0 link=yes multicast=yes port=fibre speed=10Gbit/s | resources: iomemory:400-3ff iomemory:400-3ff irq:16 memory:4002000000-4003ffffff memory:4006010000-400601ffff memory:a1d00000-a1dfffff memory:4005000000-4005ffffff memory:4006220000-400641ffff We set up the /etc/network/interfaces file by invoking Grml's netcardconfig script in automated mode, like: NET_DEV=eth0 METHOD=static IPADDR=192.168.90.51 NETMASK=255.255.255.248 GATEWAY=192.168.90.49 /usr/sbin/netcardconfig The resulting /etc/network/interfaces gets used as base for usage inside the NGCP chroot/target system. netcardconfig shuts down the network interface (eth0 in the example above) via ifdown, then sleeps for 3 seconds and re-enables the interface (via ifup) with the new configuration. This used to work fine so far, but with the Intel e810 network card and kernel version 6.1.0-9-amd64 from Debian/bookworm we see a link failure and it takes ~10 seconds until the network device is up and running again. The following vagrant_configuration() execution from deployment.sh then fails: | +11:41:01 (netscript.grml:1022): vagrant_configuration(): wget -O /var/tmp/id_rsa_sipwise.pub http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub | --2023-06-11 11:41:01-- http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub | Resolving builder.mgm.sipwise.com (builder.mgm.sipwise.com)... failed: Name or service not known. | wget: unable to resolve host address 'builder.mgm.sipwise.com' However, when we retry it again just a bit later, the network works fine again. During investigation we identified that the network card flips the port, quoting the related log from the connected Cisco nexus 5020 switch (with fast stp learning mode): | nexus5k %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/33 is down (Link failure) It seems to be related to some autonegotiation problem, as when we execute `ethtool -A eth0 rx on tx on` (no matter whether with `on` or `off`), we see: | [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: Autoneg did not complete so changing settings may not result in an actual change. | [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: NIC Link is Down | [Tue Jun 13 08:51:45 2023] ice 0000:01:00.0 eth0: NIC Link is up 10 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: NONE, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: Rx/Tx FTR: | root@sp1 ~ # ethtool -A eth0 autoneg off | netlink error: Operation not supported | 76 root@sp1 ~ # ethtool eth0 | grep -C1 Auto-negotiation | Duplex: Full | Auto-negotiation: off | Port: FIBRE | root@sp1 ~ # ethtool -A eth0 autoneg on | root@sp1 ~ # ethtool eth0 | grep -C1 Auto-negotiation | Duplex: Full | Auto-negotiation: off | Port: FIBRE | root@sp1 ~ # dmesg -T | tail -1 | [Tue Jun 13 08:53:26 2023] ice 0000:01:00.0 eth0: To change autoneg please use: ethtool -s <dev> autoneg <on|off> | root@sp1 ~ # ethtool -s eth0 autoneg off | root@sp1 ~ # ethtool -s eth0 autoneg on | netlink error: link settings update failed | netlink error: Operation not supported | 75 root@sp1 ~ # As a workaround, at least until we have a better fix/solution, we try to reach the default gateway (or fall back to the repository host if gateway couldn't be identified) via ICMP/ping, and once that works we we continue as usual. But even if that should fail we continue execution, to minimize behavior change but have a workaround for this specific situation available. FTR, broken system: | root@sp1 ~ # ethtool -i eth0 | driver: ice | version: 6.1.0-9-amd64 | firmware-version: 2.25 0x80007027 1.2934.0 | [...] Whereas with kernel 5.10.0-23-amd64 from Debian/bullseye we don't seem to see that behavior: | root@sp1:~# ethtool -i neth0 | driver: ice | version: 5.10.0-23-amd64 | firmware-version: 2.25 0x80007027 1.2934.0 | [...] Also using latest available ice v1.11.14 (from https://sourceforge.net/projects/e1000/files/ice%20stable/1.11.14/) on Kernel version 6.1.0-9-amd64 doesn't bring any change: | root@sp1 ~ # modinfo ice | filename: /lib/modules/6.1.0-9-amd64/updates/drivers/net/ethernet/intel/ice/ice.ko | firmware: intel/ice/ddp/ice.pkg | version: 1.11.14 | license: GPL v2 | description: Intel(R) Ethernet Connection E800 Series Linux Driver | author: Intel Corporation, <linux.nics@intel.com> | srcversion: 818E9C817731C98A25470C0 | alias: pci:v00008086d00001888sv*sd*bc*sc*i* | [...] | alias: pci:v00008086d00001591sv*sd*bc*sc*i* | depends: ptp | retpoline: Y | name: ice | vermagic: 6.1.0-9-amd64 SMP preempt mod_unload modversions | parm: debug:netif level (0=none,...,16=all) (int) | parm: fwlog_level:FW event level to log. All levels <= to the specified value are enabled. Values: 0=none, 1=error, 2=warning, 3=normal, 4=verbose. Invalid values: >=5 | (ushort) | parm: fwlog_events:FW events to log (32-bit mask) | (ulong) | root@sp1 ~ # ethtool -i eth0 | head -3 | driver: ice | version: 1.11.14 | firmware-version: 2.25 0x80007027 1.2934.0 | root@sp1 ~ # Change-Id: Ieafe648be4e06ed0d936611ebaf8ee54266b6f3cmr11.4
parent
f4da3e094e
commit
8cfb8c8392
Loading…
Reference in new issue