6 Replies Latest reply on Jan 29, 2018 12:27 AM by Intel Corporation

    Failure with IGB multicast group traffic after several days

    TomSalmon

      Setup:

      Debian 8 (Jessie)

      Kernel: Linux 3.16.0-4-686-pae

      IGB Driver: 5.3.5.12 (the same problem occurs with the original driver from Debian, 5.0.5-k)

       

      After several days (between 2-20) our systems stop receiving multicast packets for certain groups they belong to. Each of these systems has two interfaces, an IGB and e1000e which are bonded together (Bonding Mode: fault-tolerance (active-backup)). The problem only ever occurs when the IGB is the Active Interface.

       

      The switch to which the IGB interface is connected, is sending the relevant multicast packets to the Interface - this has been verified by using port-mirroring on the switch. However if I run tcpdump in non-promiscuous mode, I do not see any incoming packets for that group but I do see the outgoing IGMP Group Report.

      Using tcpdump in promiscuous mode will reset the interface, which immediately fixes the problem and I will see the traffic.

       

      Netstat always reports active membership of the intended groups:

      root@vlab-210-03:~# netstat -gn

      IPv6/IPv4 Group Memberships

      Interface       RefCnt Group

      --------------- ------ ---------------------

      lo              1      224.0.0.1

      eth1            1      224.0.0.1

      bond0           1      224.0.0.251

      bond0           1      224.0.0.1

      bond0           1      239.255.10.10

      bond0           1      239.255.10.11

      bond0           4      239.110.92.1

      bond0           2      224.0.0.107

      bond0           3      224.0.1.129

      eth0            1      224.0.0.1

       

      It is only ever the groups 239.255.10.10 and 239.255.10.11 which fail, and this only occurs when using the IGB interface. Other multicast traffic functions normally.

      The two affected groups are carrying a large volume of video traffic.

       

      The system is still running in a failed state, so I can query it for more information.

       

      Thanks,

      Tom.