9 Replies Latest reply: Jul 20, 2012 2:27 PM by Patrick_Kutch RSS

SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational

vshyamk Community Member
Currently Being Moderated

We have a strange problem on one of our servers using Intel 82599 SRIOV NIC. The server was working alright for almost ~8 months with SRIOV PF/VF's working fine. Suddenly we ran into an issue where one of the PF doesn't seem to be working. We need help in isolating if the SRIOV PF has failed in hardware or whether this is a software problem.

 

Currently running ethtool offline tests, exits with the below dmesg

# ethtool -t eth103 offline

The test result is PASS

The test extra info:

Register test  (offline)         0

Eeprom test    (offline)         0

Interrupt test (offline)         0

Loopback test  (offline)         0

Link test   (on/offline)         0

 

[895552.667586] ixgbe: eth103: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 64 not cleared within the polling period

 

Also show-ring shows

# ethtool --show-ring eth103

Ring parameters for eth103:

Pre-set maximums:

RX:             4096

RX Mini:        0

RX Jumbo:       0

TX:             4096

Current hardware settings:

RX:             64

RX Mini:        0

RX Jumbo:       0

TX:             64

 

only 64 rings, whereas previously it used to show 512 rings.

 

We have some VM's that have SRIOV VF's PCI assigned to them from this bad SRIOV PF. They also run into the same issue. we added some debug prints in ixgbevf driver & saw that ixgbevf_reset_hw_vf() that gets called at init fails at

        ret_val = mbx->ops.read_posted(hw, msgbuf, IXGBE_VF_PERMADDR_MSG_LEN);

with the following error

[    3.484162] ixgbevf: read_posted retval:-100 (IXGBE_ERR_MBX)

 

The link status of the SRIOV PF seems to be fine

# ip link show dev eth103

5: eth103: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

    link/ether 00:1b:21:a3:94:39 brd ff:ff:ff:ff:ff:ff

    vf 0 MAC 02:17:3e:67:a0:f8

    vf 1 MAC 02:17:3e:45:bf:4a

    vf 2 MAC 02:17:3e:78:d2:d7

    vf 3 MAC 02:17:3e:1a:fb:c6

    vf 4 MAC 02:17:3e:58:35:8d

    vf 5 MAC 02:17:3e:52:ae:4c

    vf 6 MAC 02:17:3e:62:2d:b9

    vf 7 MAC 02:17:3e:24:ae:e3

    vf 8 MAC 02:17:3e:22:35:2b

    vf 9 MAC 02:17:3e:59:86:40

    vf 10 MAC 02:17:3e:6f:9c:de

    vf 11 MAC 02:17:3e:13:0a:c1

    vf 12 MAC 02:17:3e:24:b5:79

    vf 13 MAC 02:17:3e:2d:e1:2a

    vf 14 MAC 02:17:3e:0c:11:df

    vf 15 MAC 02:17:3e:7b:82:d2

    vf 16 MAC 02:17:3e:43:5c:8d

    vf 17 MAC 02:17:3e:54:ed:b2

    vf 18 MAC 02:17:3e:70:8f:53

    vf 19 MAC 02:17:3e:55:8d:2f

    vf 20 MAC 02:17:3e:72:18:20

    vf 21 MAC 02:17:3e:12:ff:95

    vf 22 MAC 02:17:3e:71:d8:4d

    vf 23 MAC 02:17:3e:27:eb:9f

    vf 24 MAC 02:17:3e:29:7a:ad

    vf 25 MAC 02:17:3e:2c:e9:4e

    vf 26 MAC 02:17:3e:15:ce:57

    vf 27 MAC 02:17:3e:6d:61:2c

    vf 28 MAC 02:17:3e:4c:24:4d

    vf 29 MAC 02:17:3e:4c:ab:7e

    vf 30 MAC 1e:f8:b3:79:75:b2

    vf 31 MAC 02:02:2f:eb:73:1e

 

So, essentially the mailbox + tx/rx queues doesnt appear to work.

 

Dump of all registers with ethtool on this PF can be found here

https://docs.google.com/document/d/1u-QY4vwpri_l_NZii8mrnfB0bPy1OlprwICj9rM9S_o/edit

 

Our setup:

# Physical servers run ubuntu-natty (11.04) running linux-kernel 2.6.38-8-server. We are running ixgbe driver 3.2.9 that we locally compiled to disable mac anti-spoofing (primarily we call hw->mac.ops.set_mac_anti_spoofing always with disabled flag). We did this to enable bonding of SRIOV VF's within VM's

# At the physical server level we use ixgbevf 1.0.19-k0 & expose/use couple of SRIOV VF's locally within the physical server for bonding. Primarily we setup a linux active-backup bond across SRIOV VF's from two different SRIOV PF's

# We run several KVM VM's on these servers that are running ubuntu-precise (12.04) running linux-kernel 3.2.0-25-generic with  ixgbevf driver version  2.2.0-k. These VM's are PCI attached with SRIOV VF's & they in turn setup active-backup bonds across the VF's out of different SRIOV PF's.

# We setup bonds primarily for failovers & at the same time use SRIOV for performance.

 

We dont know if this problem will go away upon a power-cycle of the server. We are keeping this server in the same state if some more active state information is required. Pls let us know if any more state information would help in isolating this problem.

 

Any help appreciated.

 

Thanks

Shyam


  • 1. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    Patrick_Kutch Community Member
    Currently Being Moderated

    I will do some digging and see if I can find anything.

     

    Did you happen to have any recent updates applied to your OS?

     

    thanx,

     

    Patrick

  • 2. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    vshyamk Community Member
    Currently Being Moderated

    Thanks Patrick. No, we didnt try recent updates ixgbe driver level. We had some issues moving to 3.2.17 (some times we had SRIOV VF's spawned without irq's attached), so we moved down to 3.2.9 which was stable.

  • 3. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    Patrick_Kutch Community Member
    Currently Being Moderated

    I was actually thinking along the lines of any OS/Kernel updates; however since  you haven't rebooted in so long, that would be  unlikely.

     

    My experts are not sure what the source of your problem is.  The error is, as you pointed out the mailbox communication stopped working.  We can't tell from the description if it is a software (driver, or kernel) or a hardware problem. 

     

    All we can suggest is to save the kernel an dmesg logs and reboot.  If the PF and VF's work after reboot, we are more inclined to believe it is a software problem of some sort, otherwise a hardware failure.

     

    Also, before  you reboot, if dump the registers with the ethreg tool:

    http://sourceforge.net/projects/e1000/files/Ethregs%20-%20Register%20Dump%20Tool/

     

    If you post it, I'll see if it provides any more useful information

     

    Wish I had a magic bullet for you.

     

    thanx,

     

    Patrick

  • 4. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    alex@zadarastorage.com Community Member
    Currently Being Moderated

    Hi Patrick,

    please find the output of ethregs in:

    https://docs.google.com/open?id=0ByBy89zr3kJNdEFUMk9SV0dXdjA

     

    I just ran it on eth103 without any options, pls let us know if you need to re-run it with any particular options.

     

    Thanks for your help,

    Alex.

  • 6. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    alex@zadarastorage.com Community Member
    Currently Being Moderated

    Hi Patrick,

    I think the PF dump is also available in the attached file (it has all VFs and then the PF). Or, otherwise, pls let us know which command exactly you need to perform:

     

    .....

    03:00.0 (8086:10fb)

    Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection

        Name                  Value

        ~~~~                  ~~~~~

        CTRL                  00000000

        STATUS                000c8000

        CTRL_EXT              10010000

        ESDP                  00000876

        I2CCTL                0000000f

        FRTIMER               30bef1b6

        TCPTIMER              00000000

        PFVFLRE[1]            00000000

        LEDCTL                45444140

        PFVFLRE[0]            00000000

        PFVFLREC[0]           deadbeef

        PFVFLREC[1]           deadbeef

        PFVFLREC[2]           deadbeef

        PFVFLREC[3]           deadbeef

        PFMBICR[0]            00000000

        PFMBICR[1]            00000000

        PFMBICR[2]            00000000

        PFMBICR[3]            00000000

        PFMBIMR[0]            ffffffff

        PFMBIMR[1]            ffffffff

        PFMBIMR[2]            ffffffff

        PFMBIMR[3]            ffffffff

        EICS                  00000000

        EIAC                  4000ffff

        EITR[000]             000001e8

        EITR[001]             000003d0

        EITR[002]             00000798

        EITR[003]             00000000

        EITR[004]             00000000

        EITR[005]             00000000

        EITR[006]             00000000

        EITR[007]             00000000

        EITR[008]             00000000

        EITR[009]             00000000

        EITR[010]             00000000

        EITR[011]             00000000

        EITR[012]             00000000

        EITR[013]             00000000

        EITR[014]             00000000

        EITR[015]             00000000

        EITR[016]             00000000

    ....

  • 7. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    vshyamk Community Member
    Currently Being Moderated

    Hi Patrick,

    We rebooted the server & now the SRIOV PF/VF's are working alright. So it looks like its a s/w issue. Can you pls check if the ethregs/ethtool dump above provides any further info on the issue?

     

    Thanks.

    --Shyam

  • 8. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    vshyamk Community Member
    Currently Being Moderated

    Hi Patrick,

    One more question on this. Is there some version compatibility requirement required between ixgbe & ixgbevf?

     

    We are currently on ixgbe 3.2.9 and we have two versions of ixgbevf interfacing with the same NIC. They are ixgbevf 1.0.19-k0 & ixgbevf 2.2.0-k. Would it be an issue if differing versions of ixgbevf working with simultaneously on the NIC.

     

    One thing we observed was

    ixgbevf 2.2.0-k has a new msg code 0x6

    Though we dont use MAC VLAN, it clears up MAC VLAN like below

            ixgbevf_set_uc_addr_vf (IXGBE_VF_SET_MACVLAN)

            hw->mac.ops.set_uc_addr(hw, 0, NULL);

     

    However ixgbe 3.2.9 doesnt understand IXGBE_VF_SET_MACVLAN and prints message like this

    [1020846.780262] ixgbe: eth103: ixgbe_rcv_msg_from_vf: Unhandled Msg 00000006

     

    This happens very frequently (i.e. the ixgbevf for some reason keeps doing this almost every 2 secs) & ixgbe keeps printing this message.

     

    We dont know if there are any other such incompatibilities that can result in this behaviour? Any insights appreciated.

     

    Thanks

    Shyam

  • 9. Re: SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational
    Patrick_Kutch Community Member
    Currently Being Moderated

    We are not sure why your PF seemed to freeze.  We will keep an eye out for such behavior, thanks for bringing it to our attention.

     

    As for your PF/VF alighment.  They are fairly tightly coupled.  The way the VF driver communicates with the PF driver is through messages in the mailbox.  If one side doesn't understand the other, such an error will occur.

     

    I’d recommend the user to update both PF and VF drivers to the latest version that are available from our Source forge site.  URL below:

     

    http://sourceforge.net/projects/e1000/files/

     

    PF Driver - latest ixgbe version is 3.10.16

    VF Driver - latest ixgbevf version is 2.6.2

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points