MT#57630 Check online connectivity to work around Intel E810 / ice issue

Deploying the Debian/bookworm based NGCP system fails on a Lenovo sr250
v2 node with an Intel E810 network card:

| # lshw -c net -businfo
| Bus info          Device     Class          Description
| =======================================================
| pci@0000:01:00.0  eth0       network        Ethernet Controller E810-XXV for SFP
| pci@0000:01:00.1  eth1       network        Ethernet Controller E810-XXV for SFP
| # lshw -c net
|     *-network:0
| 	 description: Ethernet interface
| 	 product: Ethernet Controller E810-XXV for SFP
| 	 vendor: Intel Corporation
| 	 physical id: 0
| 	 bus info: pci@0000:01:00.0
| 	 logical name: eth0
| 	 version: 02
| 	 serial: [...]
| 	 size: 10Gbit/s
| 	 capacity: 25Gbit/s
| 	 width: 64 bits
| 	 clock: 33MHz
| 	 capabilities: pm msi msix pciexpress vpd bus_master cap_list rom ethernet physical fibre 1000bt-fd 25000bt-fd
| 	 configuration: autonegotiation=off broadcast=yes driver=ice driverversion=1.11.14 duplex=full firmware=2.25 0x80007027 1.2934.0 ip=192.168.90.51 latency=0 link=yes multicast=yes port=fibre speed=10Gbit/s
| 	 resources: iomemory:400-3ff iomemory:400-3ff irq:16 memory:4002000000-4003ffffff memory:4006010000-400601ffff memory:a1d00000-a1dfffff memory:4005000000-4005ffffff memory:4006220000-400641ffff

We set up the /etc/network/interfaces file by invoking Grml's
netcardconfig script in automated mode, like:

  NET_DEV=eth0 METHOD=static IPADDR=192.168.90.51 NETMASK=255.255.255.248 GATEWAY=192.168.90.49 /usr/sbin/netcardconfig

The resulting /etc/network/interfaces gets used as base for usage inside
the NGCP chroot/target system. netcardconfig shuts down the network
interface (eth0 in the example above) via ifdown, then sleeps for 3
seconds and re-enables the interface (via ifup) with the new
configuration.

This used to work fine so far, but with the Intel e810 network card and
kernel version 6.1.0-9-amd64 from Debian/bookworm we see a link failure
and it takes ~10 seconds until the network device is up and running
again. The following vagrant_configuration() execution from
deployment.sh then fails:

| +11:41:01 (netscript.grml:1022): vagrant_configuration(): wget -O /var/tmp/id_rsa_sipwise.pub http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub
| --2023-06-11 11:41:01-- http://builder.mgm.sipwise.com/vagrant-ngcp/id_rsa_sipwise.pub
| Resolving builder.mgm.sipwise.com (builder.mgm.sipwise.com)... failed: Name or service not known.
| wget: unable to resolve host address 'builder.mgm.sipwise.com'

However, when we retry it again just a bit later, the network works fine
again.  During investigation we identified that the network card flips
the port, quoting the related log from the connected Cisco nexus 5020
switch (with fast stp learning mode):

| nexus5k %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/33 is down (Link failure)

It seems to be related to some autonegotiation problem, as when we
execute `ethtool -A eth0 rx on tx on` (no matter whether with `on` or
`off`), we see:

| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: Autoneg did not complete so changing settings may not result in an actual change.
| [Tue Jun 13 08:51:37 2023] ice 0000:01:00.0 eth0: NIC Link is Down
| [Tue Jun 13 08:51:45 2023] ice 0000:01:00.0 eth0: NIC Link is up 10 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: NONE, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: Rx/Tx

FTR:

| root@sp1 ~ # ethtool -A eth0 autoneg off
| netlink error: Operation not supported
| 76 root@sp1 ~ # ethtool eth0 | grep -C1 Auto-negotiation
|         Duplex: Full
|         Auto-negotiation: off
|         Port: FIBRE
| root@sp1 ~ # ethtool -A eth0 autoneg on
| root@sp1 ~ # ethtool eth0 | grep -C1 Auto-negotiation
|         Duplex: Full
|         Auto-negotiation: off
|         Port: FIBRE
| root@sp1 ~ # dmesg -T | tail -1
| [Tue Jun 13 08:53:26 2023] ice 0000:01:00.0 eth0: To change autoneg please use: ethtool -s <dev> autoneg <on|off>
| root@sp1 ~ # ethtool -s eth0 autoneg off
| root@sp1 ~ # ethtool -s eth0 autoneg on
| netlink error: link settings update failed
| netlink error: Operation not supported
| 75 root@sp1 ~ #

As a workaround, at least until we have a better fix/solution, we try to
reach the default gateway (or fall back to the repository host if
gateway couldn't be identified) via ICMP/ping, and once that works we we
continue as usual. But even if that should fail we continue execution,
to minimize behavior change but have a workaround for this specific
situation available.

FTR, broken system:

| root@sp1 ~ # ethtool -i eth0
| driver: ice
| version: 6.1.0-9-amd64
| firmware-version: 2.25 0x80007027 1.2934.0
| [...]

Whereas with kernel 5.10.0-23-amd64 from Debian/bullseye we don't seem
to see that behavior:

| root@sp1:~# ethtool -i neth0
| driver: ice
| version: 5.10.0-23-amd64
| firmware-version: 2.25 0x80007027 1.2934.0
| [...]

Also using latest available ice v1.11.14 (from
https://sourceforge.net/projects/e1000/files/ice%20stable/1.11.14/)
on Kernel version 6.1.0-9-amd64 doesn't bring any change:

| root@sp1 ~ # modinfo ice
| filename:       /lib/modules/6.1.0-9-amd64/updates/drivers/net/ethernet/intel/ice/ice.ko
| firmware:       intel/ice/ddp/ice.pkg
| version:        1.11.14
| license:        GPL v2
| description:    Intel(R) Ethernet Connection E800 Series Linux Driver
| author:         Intel Corporation, <linux.nics@intel.com>
| srcversion:     818E9C817731C98A25470C0
| alias:          pci:v00008086d00001888sv*sd*bc*sc*i*
| [...]
| alias:          pci:v00008086d00001591sv*sd*bc*sc*i*
| depends:        ptp
| retpoline:      Y
| name:           ice
| vermagic:       6.1.0-9-amd64 SMP preempt mod_unload modversions
| parm:           debug:netif level (0=none,...,16=all) (int)
| parm:           fwlog_level:FW event level to log. All levels <= to the specified value are enabled. Values: 0=none, 1=error, 2=warning, 3=normal, 4=verbose. Invalid values: >=5
|  (ushort)
| parm:           fwlog_events:FW events to log (32-bit mask)
|  (ulong)
| root@sp1 ~ # ethtool -i eth0 | head -3
| driver: ice
| version: 1.11.14
| firmware-version: 2.25 0x80007027 1.2934.0
| root@sp1 ~ #

Change-Id: Ieafe648be4e06ed0d936611ebaf8ee54266b6f3c
mr11.4
Michael Prokop 2 years ago
parent f4da3e094e
commit 8cfb8c8392

@ -1324,6 +1324,35 @@ set_custom_grub_boot_options() {
fi
}
get_ping_host() {
local route
route="$(route -n | awk '/^0\.0\.0\.0/{print $2}')"
if [ -n "${route:-}" ] ; then
ping_host="${route}"
echo "Default route identified, using host ${ping_host}"
else
ping_host="${SIPWISE_REPO_HOST:-deb.sipwise.com}"
echo "Default route identified, using host ${ping_host} instead"
fi
}
wait_for_network_online() {
local tries="${1:-30}"
echo "Trying reach host ${ping_host} via ICMP/ping to check connectivity"
while ! ping -O -D -c 1 -i 1 -W 1 "${ping_host}" ; do
if [ "${tries}" -gt 0 ] ; then
tries=$((tries-1))
echo "Retrying ping to ${ping_host} again ($tries tries left)..."
sleep 1
else
echo "WARN: couldn't reach host ${ping_host} via ICMP/ping, continuing anyway"
break
fi
done
}
# Main script
@ -2247,6 +2276,9 @@ EOT
if grml-chroot "${TARGET}" /bin/bash /tmp/ngcp-installer-deployment.sh ; then
echo "ngcp-installer finished successfully"
echo "Trying to identify ping_host"
get_ping_host
# Check the current method of external interface
# If it is manual - we need to reconfigure /e/n/i to get working network configuration after the reboot
method=$( sed -rn "s/^iface ${INSTALL_DEV} inet ([A-Za-z]+)/\1/p" < /etc/network/interfaces )
@ -2273,6 +2305,9 @@ EOT
die "Error during installation of ngcp. Find details at: ${TARGET}/var/log/ngcp-installer.log"
fi
echo "Checking for network connectivity (workaround for e.g. ice network drive issue)"
wait_for_network_online 15
echo "Generating udev persistent net rules ..."
grml-chroot "${TARGET}" /usr/sbin/ngcp-initialize-udev-rules-net

Loading…
Cancel
Save