1 Reply Latest reply: Apr 20, 2012 8:49 AM by Gary Molenkamp RSS

Kernel Panic when bridging two 10G x520-DAs

Gary Molenkamp Community Member
Currently Being Moderated

I have two 10G x520-DA2 nics (82599EB) running latest ixgbe driver 3.8.21 under latest centos 6.2 kernel (2.6.32-220.13.1.el6.x86_64).

 

Due to issues with vlan tagging over the bridged interfaces ( http://communities.intel.com/message/152866 ), I have the bridges configured as:

 

#: brctl show
bridge name    bridge id        STP enabled    interfaces
br0        8000.001b21d73a78    no        eth0
                                                              eth2
br253        8000.001b21d73a78    no        eth0.253
                                                                  eth2.253
br353        8000.001b21d73a78    no        eth0.353
                                                                  eth2.353
br653        8000.001b21d73a78    no        eth0.653
                                                                  eth2.653

 

Iptables and ip6tables are called on the bridge devices:

net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1

 

About 5-10 seconds after passing traffic through the br253 bridge device, the kernel panics with the following:

 

kernel:general protection fault: 0000 [#1] SMP
kernel:last sysfs file: /sys/devices/virtual/net/br653/bridge/multicast_startup_query_interval

kernel:Stack:

kernel:Call Trace:

kernel:Code: 5f 3a 00 48 8b 05 19 1a e6 00 48 c7 c2 f8 b2 fa 81 48 85 c0 74 26 48 8b 4b 08 48 3b 48 08 77 11 eb 1a 66 0f 1f 84 00 00 00 00 00 <48> 39 48 08 73 0b 48 89

kernel:Kernel panic - not syncing: Fatal exception

 

The console also has an addition line:

stack-protector: Kernel stack is corrupted in: ffffffff8148f073

 

Any help/pointers would be appreciated.

  • 1. Re: Kernel Panic when bridging two 10G x520-DAs
    Gary Molenkamp Community Member
    Currently Being Moderated

    Using a replayable pcap of live traffic through a 1G based bridge, I am able to replicate the panic on demand.

     

    Usng this pcap, I captured a crash dump of the kernel during the panic:

     

    KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.7.1.el6.x86_64.debug/vmlinux
        DUMPFILE: ./vmcore  [PARTIAL DUMP]
            CPUS: 6
            DATE: Fri Apr 20 06:21:21 2012
          UPTIME: 00:02:37
    LOAD AVERAGE: 0.03, 0.04, 0.01
           TASKS: 179
        NODENAME: test.cluster
         RELEASE: 2.6.32-220.7.1.el6.x86_64.debug
         VERSION: #1 SMP Wed Mar 7 01:52:51 GMT 2012
         MACHINE: x86_64  (3200 Mhz)
          MEMORY: 15.9 GB
           PANIC: "Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff814bd0c3"
             PID: 0
         COMMAND: "swapper"
            TASK: ffffffff81821020  (1 of 6)  [THREAD_INFO: ffffffff81794000]
             CPU: 0
           STATE: TASK_RUNNING (PANIC)

     

    From the log dump:

     

    th2: no IPv6 routers present
    eth0: no IPv6 routers present
    eth4: no IPv6 routers present
    Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff814bd0c3

     

    Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64.debug #1
    Call Trace:
    <IRQ>  [<ffffffff8151cd80>] ? panic+0x78/0x148
    [<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
    [<ffffffff8106e2fb>] ? __stack_chk_fail+0x1b/0x30
    [<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
    [<ffffffffa04b121b>] ? ipt_do_table+0x3cb/0x678 [ip_tables]
    [<ffffffff81012d29>] ? sched_clock+0x9/0x10
    [<ffffffff8109e1b5>] ? sched_clock_local+0x25/0x90
    [<ffffffff8109e2d8>] ? sched_clock_cpu+0xb8/0x110
    [<ffffffffa04b90f3>] ? ipt_hook+0x23/0x30 [iptable_filter]
    [<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
    [<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
    [<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
    [<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
    [<ffffffffa03c6f0e>] ? br_nf_forward_ip+0x1ee/0x3c0 [bridge]
    [<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
    [<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
    [<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
    [<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
    [<ffffffff8109e407>] ? cpu_clock+0x57/0x80
    [<ffffffffa03bf750>] ? __br_forward+0x0/0xc0 [bridge]
    [<ffffffffa03bf7c2>] ? __br_forward+0x72/0xc0 [bridge]
    [<ffffffffa03bf5c1>] ? br_flood+0xc1/0xd0 [bridge]
    [<ffffffffa03bf5e5>] ? br_flood_forward+0x15/0x20 [bridge]
    [<ffffffffa03c087e>] ? br_handle_frame_finish+0x27e/0x2a0 [bridge]
    [<ffffffffa03c5820>] ? nf_bridge_alloc+0x30/0xc0 [bridge]
    [<ffffffffa03c64a8>] ? br_nf_pre_routing_finish+0x228/0x340 [bridge]
    [<ffffffffa03c6a1f>] ? br_nf_pre_routing+0x45f/0x760 [bridge]
    [<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
    [<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
    [<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
    [<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
    [<ffffffffa03c0a2c>] ? br_handle_frame+0x18c/0x250 [bridge]
    [<ffffffff8145bd29>] ? __netif_receive_skb+0x569/0x740
    [<ffffffff8145b8f0>] ? __netif_receive_skb+0x130/0x740
    [<ffffffff8145aae6>] ? get_rps_cpu+0x126/0x3b0
    [<ffffffff8145a9c0>] ? get_rps_cpu+0x0/0x3b0
    [<ffffffff8145c058>] ? netif_receive_skb+0x58/0x60
    [<ffffffff8145c160>] ? napi_skb_finish+0x50/0x70
    [<ffffffff814f73f4>] ? vlan_gro_receive+0x84/0xa0
    [<ffffffffa02fc593>] ? ixgbe_poll+0xd43/0x1410 [ixgbe]
    [<ffffffff8145c9f8>] ? net_rx_action+0x188/0x3a0
    [<ffffffff8145c970>] ? net_rx_action+0x100/0x3a0
    [<ffffffff81076a0d>] ? __do_softirq+0xdd/0x200
    [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
    [<ffffffff8100dfdd>] ? do_softirq+0xad/0xe0
    [<ffffffff810765f5>] ? irq_exit+0x95/0xa0
    [<ffffffff81526de5>] ? do_IRQ+0x75/0xf0
    [<ffffffff8100ba93>] ? ret_from_intr+0x0/0x16
    <EOI>  [<ffffffff812ecf18>] ? intel_idle+0xe8/0x170
    [<ffffffff812ecf11>] ? intel_idle+0xe1/0x170
    [<ffffffff81524540>] ? __atomic_notifier_call_chain+0x0/0xa0
    [<ffffffff814253c7>] ? cpuidle_idle_call+0xa7/0x150
    [<ffffffff81009e0b>] ? cpu_idle+0xbb/0x110
    [<ffffffff81503f6a>] ? rest_init+0x7a/0x80
    [<ffffffff81b89fa7>] ? start_kernel+0x456/0x462
    [<ffffffff81b8933a>] ? x86_64_start_reservations+0x125/0x129
    [<ffffffff81b89438>] ? x86_64_start_kernel+0xfa/0x109

     

    =================================
    [ INFO: inconsistent lock state ]
    2.6.32-220.7.1.el6.x86_64.debug #1
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/0 [HC0[0]:SC1[2]:HE0:SE0] takes:
    (pgd_lock){+.?...}, at: [<ffffffff81043990>] vmalloc_sync_all+0x80/0x170
    {SOFTIRQ-ON-W} state was registered at:
      [<ffffffff810afa8c>] __lock_acquire+0x63c/0x1570
      [<ffffffff810b0a64>] lock_acquire+0xa4/0x120
      [<ffffffff81520bf6>] _spin_lock+0x36/0x70
      [<ffffffff81044bdc>] __change_page_attr_set_clr+0x1ac/0xbd0
      [<ffffffff8104573e>] change_page_attr_set_clr+0x13e/0x530
      [<ffffffff8104607f>] _set_memory_wb+0x2f/0x40
      [<ffffffff810445a7>] ioremap_change_attr+0x17/0x40
      [<ffffffff81046d76>] kernel_map_sync_memtype+0x86/0xf0
      [<ffffffff810441f2>] __ioremap_caller+0x292/0x3c0
      [<ffffffff81044414>] ioremap_cache+0x14/0x20
      [<ffffffff8150887a>] acpi_os_map_memory+0x17/0x20
      [<ffffffff8130ef5e>] acpi_tb_verify_table+0x2e/0x5c
      [<ffffffff8130e763>] acpi_load_tables+0x3e/0x133
      [<ffffffff81bbe15e>] acpi_early_init+0x60/0xf5
      [<ffffffff81b89f98>] start_kernel+0x447/0x462
      [<ffffffff81b8933a>] x86_64_start_reservations+0x125/0x129
      [<ffffffff81b89438>] x86_64_start_kernel+0xfa/0x109
    irq event stamp: 529583
    hardirqs last  enabled at (529582): [<ffffffff815209c0>] _spin_unlock_irqrestore+0x40/0x80
    hardirqs last disabled at (529583): [<ffffffff81034390>] native_machine_crash_shutdown+0x40/0x210
    softirqs last  enabled at (529484): [<ffffffff81076a7a>] __do_softirq+0x14a/0x200
    softirqs last disabled at (529489): [<ffffffff8100c30c>] call_softirq+0x1c/0x30

     

    other info that might help us debug this:
    7 locks held by swapper/0:
    #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8145c970>] net_rx_action+0x100/0x3a0
    #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8145b8f0>] __netif_receive_skb+0x130/0x740
    #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff81484b20>] nf_hook_slow+0x0/0x140
    #3:  (rcu_read_lock){.+.+..}, at: [<ffffffff81484b20>] nf_hook_slow+0x0/0x140
    #4:  (rcu_read_lock){.+.+..}, at: [<ffffffff81484b20>] nf_hook_slow+0x0/0x140
    #5:  (&lock->lock){+.-...}, at: [<ffffffffa04b0f62>] ipt_do_table+0x112/0x678 [ip_tables]
    #6:  (kexec_mutex){+.+.+.}, at: [<ffffffff810cad27>] crash_kexec+0x27/0x110

     

    stack backtrace:
    Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64.debug #1
    Call Trace:
    <IRQ>  [<ffffffff810ad947>] ? print_usage_bug+0x177/0x180
    [<ffffffff810ae8ed>] ? mark_lock+0x35d/0x430
    [<ffffffff810afa2a>] ? __lock_acquire+0x5da/0x1570
    [<ffffffff810199bd>] ? save_stack_trace+0x2d/0x50
    [<ffffffff8129610c>] ? put_dec+0x10c/0x110
    [<ffffffff810b0a64>] ? lock_acquire+0xa4/0x120
    [<ffffffff81043990>] ? vmalloc_sync_all+0x80/0x170
    [<ffffffff81520bf6>] ? _spin_lock+0x36/0x70
    [<ffffffff81043990>] ? vmalloc_sync_all+0x80/0x170
    [<ffffffff81043990>] ? vmalloc_sync_all+0x80/0x170
    [<ffffffff8109d796>] ? register_die_notifier+0x16/0x30
    [<ffffffff810341f0>] ? kdump_nmi_callback+0x0/0x160
    [<ffffffff81029f99>] ? nmi_shootdown_cpus+0x59/0xc0
    [<ffffffff810343a6>] ? native_machine_crash_shutdown+0x56/0x210
    [<ffffffff810ca93b>] ? append_elf_note+0x8b/0xb0
    [<ffffffff81029e0f>] ? machine_crash_shutdown+0xf/0x20
    [<ffffffff810cad66>] ? crash_kexec+0x66/0x110
    [<ffffffff8100f495>] ? show_trace+0x15/0x20
    [<ffffffff810cadff>] ? crash_kexec+0xff/0x110
    [<ffffffff8151cd87>] ? panic+0x7f/0x148
    [<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
    [<ffffffff8106e2fb>] ? __stack_chk_fail+0x1b/0x30
    [<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
    [<ffffffffa04b121b>] ? ipt_do_table+0x3cb/0x678 [ip_tables]
    [<ffffffff81012d29>] ? sched_clock+0x9/0x10
    [<ffffffff8109e1b5>] ? sched_clock_local+0x25/0x90
    [<ffffffff8109e2d8>] ? sched_clock_cpu+0xb8/0x110
    [<ffffffffa04b90f3>] ? ipt_hook+0x23/0x30 [iptable_filter]
    [<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
    [<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
    [<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
    [<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
    [<ffffffffa03c6f0e>] ? br_nf_forward_ip+0x1ee/0x3c0 [bridge]
    [<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
    [<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
    [<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
    [<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
    [<ffffffff8109e407>] ? cpu_clock+0x57/0x80
    [<ffffffffa03bf750>] ? __br_forward+0x0/0xc0 [bridge]
    [<ffffffffa03bf7c2>] ? __br_forward+0x72/0xc0 [bridge]
    [<ffffffffa03bf5c1>] ? br_flood+0xc1/0xd0 [bridge]
    [<ffffffffa03bf5e5>] ? br_flood_forward+0x15/0x20 [bridge]
    [<ffffffffa03c087e>] ? br_handle_frame_finish+0x27e/0x2a0 [bridge]
    [<ffffffffa03c5820>] ? nf_bridge_alloc+0x30/0xc0 [bridge]
    [<ffffffffa03c64a8>] ? br_nf_pre_routing_finish+0x228/0x340 [bridge]
    [<ffffffffa03c6a1f>] ? br_nf_pre_routing+0x45f/0x760 [bridge]
    [<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
    [<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
    [<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
    [<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
    [<ffffffffa03c0a2c>] ? br_handle_frame+0x18c/0x250 [bridge]
    [<ffffffff8145bd29>] ? __netif_receive_skb+0x569/0x740
    [<ffffffff8145b8f0>] ? __netif_receive_skb+0x130/0x740
    [<ffffffff8145aae6>] ? get_rps_cpu+0x126/0x3b0
    [<ffffffff8145a9c0>] ? get_rps_cpu+0x0/0x3b0
    [<ffffffff8145c058>] ? netif_receive_skb+0x58/0x60
    [<ffffffff8145c160>] ? napi_skb_finish+0x50/0x70
    [<ffffffff814f73f4>] ? vlan_gro_receive+0x84/0xa0
    [<ffffffffa02fc593>] ? ixgbe_poll+0xd43/0x1410 [ixgbe]
    [<ffffffff8145c9f8>] ? net_rx_action+0x188/0x3a0
    [<ffffffff8145c970>] ? net_rx_action+0x100/0x3a0
    [<ffffffff81076a0d>] ? __do_softirq+0xdd/0x200
    [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
    [<ffffffff8100dfdd>] ? do_softirq+0xad/0xe0
    [<ffffffff810765f5>] ? irq_exit+0x95/0xa0
    [<ffffffff81526de5>] ? do_IRQ+0x75/0xf0
    [<ffffffff8100ba93>] ? ret_from_intr+0x0/0x16
    <EOI>  [<ffffffff812ecf18>] ? intel_idle+0xe8/0x170
    [<ffffffff812ecf11>] ? intel_idle+0xe1/0x170
    [<ffffffff81524540>] ? __atomic_notifier_call_chain+0x0/0xa0
    [<ffffffff814253c7>] ? cpuidle_idle_call+0xa7/0x150
    [<ffffffff81009e0b>] ? cpu_idle+0xbb/0x110
    [<ffffffff81503f6a>] ? rest_init+0x7a/0x80
    [<ffffffff81b89fa7>] ? start_kernel+0x456/0x462
    [<ffffffff81b8933a>] ? x86_64_start_reservations+0x125/0x129
    [<ffffffff81b89438>] ? x86_64_start_kernel+0xfa/0x109

     

    Does this look like a driver, bridge or netfilter issue?

     

    Using the same packet capture trace, I reran the test against the kernel's ixgbe driver (3.4.8-k) and the system remains stable.

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points