I have two 10G x520-DA2 nics (82599EB) running latest ixgbe driver 3.8.21 under latest centos 6.2 kernel (2.6.32-220.13.1.el6.x86_64).
Due to issues with vlan tagging over the bridged interfaces ( http://communities.intel.com/message/152866 ), I have the bridges configured as:
#: brctl show
bridge name bridge id STP enabled interfaces
br0 8000.001b21d73a78 no eth0
eth2
br253 8000.001b21d73a78 no eth0.253
eth2.253
br353 8000.001b21d73a78 no eth0.353
eth2.353
br653 8000.001b21d73a78 no eth0.653
eth2.653
Iptables and ip6tables are called on the bridge devices:
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
About 5-10 seconds after passing traffic through the br253 bridge device, the kernel panics with the following:
kernel:general protection fault: 0000 [#1] SMP
kernel:last sysfs file: /sys/devices/virtual/net/br653/bridge/multicast_startup_query_interval
kernel:Stack:
kernel:Call Trace:
kernel:Code: 5f 3a 00 48 8b 05 19 1a e6 00 48 c7 c2 f8 b2 fa 81 48 85 c0 74 26 48 8b 4b 08 48 3b 48 08 77 11 eb 1a 66 0f 1f 84 00 00 00 00 00 <48> 39 48 08 73 0b 48 89
kernel:Kernel panic - not syncing: Fatal exception
The console also has an addition line:
stack-protector: Kernel stack is corrupted in: ffffffff8148f073
Any help/pointers would be appreciated.
Using a replayable pcap of live traffic through a 1G based bridge, I am able to replicate the panic on demand.
Usng this pcap, I captured a crash dump of the kernel during the panic:
KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.7.1.el6.x86_64.debug/vmlinux
DUMPFILE: ./vmcore [PARTIAL DUMP]
CPUS: 6
DATE: Fri Apr 20 06:21:21 2012
UPTIME: 00:02:37
LOAD AVERAGE: 0.03, 0.04, 0.01
TASKS: 179
NODENAME: test.cluster
RELEASE: 2.6.32-220.7.1.el6.x86_64.debug
VERSION: #1 SMP Wed Mar 7 01:52:51 GMT 2012
MACHINE: x86_64 (3200 Mhz)
MEMORY: 15.9 GB
PANIC: "Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff814bd0c3"
PID: 0
COMMAND: "swapper"
TASK: ffffffff81821020 (1 of 6) [THREAD_INFO: ffffffff81794000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
From the log dump:
th2: no IPv6 routers present
eth0: no IPv6 routers present
eth4: no IPv6 routers present
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff814bd0c3
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64.debug #1
Call Trace:
<IRQ> [<ffffffff8151cd80>] ? panic+0x78/0x148
[<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
[<ffffffff8106e2fb>] ? __stack_chk_fail+0x1b/0x30
[<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
[<ffffffffa04b121b>] ? ipt_do_table+0x3cb/0x678 [ip_tables]
[<ffffffff81012d29>] ? sched_clock+0x9/0x10
[<ffffffff8109e1b5>] ? sched_clock_local+0x25/0x90
[<ffffffff8109e2d8>] ? sched_clock_cpu+0xb8/0x110
[<ffffffffa04b90f3>] ? ipt_hook+0x23/0x30 [iptable_filter]
[<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
[<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
[<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
[<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
[<ffffffffa03c6f0e>] ? br_nf_forward_ip+0x1ee/0x3c0 [bridge]
[<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
[<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
[<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
[<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
[<ffffffff8109e407>] ? cpu_clock+0x57/0x80
[<ffffffffa03bf750>] ? __br_forward+0x0/0xc0 [bridge]
[<ffffffffa03bf7c2>] ? __br_forward+0x72/0xc0 [bridge]
[<ffffffffa03bf5c1>] ? br_flood+0xc1/0xd0 [bridge]
[<ffffffffa03bf5e5>] ? br_flood_forward+0x15/0x20 [bridge]
[<ffffffffa03c087e>] ? br_handle_frame_finish+0x27e/0x2a0 [bridge]
[<ffffffffa03c5820>] ? nf_bridge_alloc+0x30/0xc0 [bridge]
[<ffffffffa03c64a8>] ? br_nf_pre_routing_finish+0x228/0x340 [bridge]
[<ffffffffa03c6a1f>] ? br_nf_pre_routing+0x45f/0x760 [bridge]
[<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
[<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
[<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
[<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
[<ffffffffa03c0a2c>] ? br_handle_frame+0x18c/0x250 [bridge]
[<ffffffff8145bd29>] ? __netif_receive_skb+0x569/0x740
[<ffffffff8145b8f0>] ? __netif_receive_skb+0x130/0x740
[<ffffffff8145aae6>] ? get_rps_cpu+0x126/0x3b0
[<ffffffff8145a9c0>] ? get_rps_cpu+0x0/0x3b0
[<ffffffff8145c058>] ? netif_receive_skb+0x58/0x60
[<ffffffff8145c160>] ? napi_skb_finish+0x50/0x70
[<ffffffff814f73f4>] ? vlan_gro_receive+0x84/0xa0
[<ffffffffa02fc593>] ? ixgbe_poll+0xd43/0x1410 [ixgbe]
[<ffffffff8145c9f8>] ? net_rx_action+0x188/0x3a0
[<ffffffff8145c970>] ? net_rx_action+0x100/0x3a0
[<ffffffff81076a0d>] ? __do_softirq+0xdd/0x200
[<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
[<ffffffff8100dfdd>] ? do_softirq+0xad/0xe0
[<ffffffff810765f5>] ? irq_exit+0x95/0xa0
[<ffffffff81526de5>] ? do_IRQ+0x75/0xf0
[<ffffffff8100ba93>] ? ret_from_intr+0x0/0x16
<EOI> [<ffffffff812ecf18>] ? intel_idle+0xe8/0x170
[<ffffffff812ecf11>] ? intel_idle+0xe1/0x170
[<ffffffff81524540>] ? __atomic_notifier_call_chain+0x0/0xa0
[<ffffffff814253c7>] ? cpuidle_idle_call+0xa7/0x150
[<ffffffff81009e0b>] ? cpu_idle+0xbb/0x110
[<ffffffff81503f6a>] ? rest_init+0x7a/0x80
[<ffffffff81b89fa7>] ? start_kernel+0x456/0x462
[<ffffffff81b8933a>] ? x86_64_start_reservations+0x125/0x129
[<ffffffff81b89438>] ? x86_64_start_kernel+0xfa/0x109
=================================
[ INFO: inconsistent lock state ]
2.6.32-220.7.1.el6.x86_64.debug #1
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0 [HC0[0]:SC1[2]:HE0:SE0] takes:
(pgd_lock){+.?...}, at: [<ffffffff81043990>] vmalloc_sync_all+0x80/0x170
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810afa8c>] __lock_acquire+0x63c/0x1570
[<ffffffff810b0a64>] lock_acquire+0xa4/0x120
[<ffffffff81520bf6>] _spin_lock+0x36/0x70
[<ffffffff81044bdc>] __change_page_attr_set_clr+0x1ac/0xbd0
[<ffffffff8104573e>] change_page_attr_set_clr+0x13e/0x530
[<ffffffff8104607f>] _set_memory_wb+0x2f/0x40
[<ffffffff810445a7>] ioremap_change_attr+0x17/0x40
[<ffffffff81046d76>] kernel_map_sync_memtype+0x86/0xf0
[<ffffffff810441f2>] __ioremap_caller+0x292/0x3c0
[<ffffffff81044414>] ioremap_cache+0x14/0x20
[<ffffffff8150887a>] acpi_os_map_memory+0x17/0x20
[<ffffffff8130ef5e>] acpi_tb_verify_table+0x2e/0x5c
[<ffffffff8130e763>] acpi_load_tables+0x3e/0x133
[<ffffffff81bbe15e>] acpi_early_init+0x60/0xf5
[<ffffffff81b89f98>] start_kernel+0x447/0x462
[<ffffffff81b8933a>] x86_64_start_reservations+0x125/0x129
[<ffffffff81b89438>] x86_64_start_kernel+0xfa/0x109
irq event stamp: 529583
hardirqs last enabled at (529582): [<ffffffff815209c0>] _spin_unlock_irqrestore+0x40/0x80
hardirqs last disabled at (529583): [<ffffffff81034390>] native_machine_crash_shutdown+0x40/0x210
softirqs last enabled at (529484): [<ffffffff81076a7a>] __do_softirq+0x14a/0x200
softirqs last disabled at (529489): [<ffffffff8100c30c>] call_softirq+0x1c/0x30
other info that might help us debug this:
7 locks held by swapper/0:
#0: (rcu_read_lock){.+.+..}, at: [<ffffffff8145c970>] net_rx_action+0x100/0x3a0
#1: (rcu_read_lock){.+.+..}, at: [<ffffffff8145b8f0>] __netif_receive_skb+0x130/0x740
#2: (rcu_read_lock){.+.+..}, at: [<ffffffff81484b20>] nf_hook_slow+0x0/0x140
#3: (rcu_read_lock){.+.+..}, at: [<ffffffff81484b20>] nf_hook_slow+0x0/0x140
#4: (rcu_read_lock){.+.+..}, at: [<ffffffff81484b20>] nf_hook_slow+0x0/0x140
#5: (&lock->lock){+.-...}, at: [<ffffffffa04b0f62>] ipt_do_table+0x112/0x678 [ip_tables]
#6: (kexec_mutex){+.+.+.}, at: [<ffffffff810cad27>] crash_kexec+0x27/0x110
stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64.debug #1
Call Trace:
<IRQ> [<ffffffff810ad947>] ? print_usage_bug+0x177/0x180
[<ffffffff810ae8ed>] ? mark_lock+0x35d/0x430
[<ffffffff810afa2a>] ? __lock_acquire+0x5da/0x1570
[<ffffffff810199bd>] ? save_stack_trace+0x2d/0x50
[<ffffffff8129610c>] ? put_dec+0x10c/0x110
[<ffffffff810b0a64>] ? lock_acquire+0xa4/0x120
[<ffffffff81043990>] ? vmalloc_sync_all+0x80/0x170
[<ffffffff81520bf6>] ? _spin_lock+0x36/0x70
[<ffffffff81043990>] ? vmalloc_sync_all+0x80/0x170
[<ffffffff81043990>] ? vmalloc_sync_all+0x80/0x170
[<ffffffff8109d796>] ? register_die_notifier+0x16/0x30
[<ffffffff810341f0>] ? kdump_nmi_callback+0x0/0x160
[<ffffffff81029f99>] ? nmi_shootdown_cpus+0x59/0xc0
[<ffffffff810343a6>] ? native_machine_crash_shutdown+0x56/0x210
[<ffffffff810ca93b>] ? append_elf_note+0x8b/0xb0
[<ffffffff81029e0f>] ? machine_crash_shutdown+0xf/0x20
[<ffffffff810cad66>] ? crash_kexec+0x66/0x110
[<ffffffff8100f495>] ? show_trace+0x15/0x20
[<ffffffff810cadff>] ? crash_kexec+0xff/0x110
[<ffffffff8151cd87>] ? panic+0x7f/0x148
[<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
[<ffffffff8106e2fb>] ? __stack_chk_fail+0x1b/0x30
[<ffffffff814bd0c3>] ? icmp_send+0x743/0x780
[<ffffffffa04b121b>] ? ipt_do_table+0x3cb/0x678 [ip_tables]
[<ffffffff81012d29>] ? sched_clock+0x9/0x10
[<ffffffff8109e1b5>] ? sched_clock_local+0x25/0x90
[<ffffffff8109e2d8>] ? sched_clock_cpu+0xb8/0x110
[<ffffffffa04b90f3>] ? ipt_hook+0x23/0x30 [iptable_filter]
[<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
[<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
[<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
[<ffffffffa03c5a30>] ? br_nf_forward_finish+0x0/0x140 [bridge]
[<ffffffffa03c6f0e>] ? br_nf_forward_ip+0x1ee/0x3c0 [bridge]
[<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
[<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
[<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
[<ffffffffa03bf6f0>] ? br_forward_finish+0x0/0x60 [bridge]
[<ffffffff8109e407>] ? cpu_clock+0x57/0x80
[<ffffffffa03bf750>] ? __br_forward+0x0/0xc0 [bridge]
[<ffffffffa03bf7c2>] ? __br_forward+0x72/0xc0 [bridge]
[<ffffffffa03bf5c1>] ? br_flood+0xc1/0xd0 [bridge]
[<ffffffffa03bf5e5>] ? br_flood_forward+0x15/0x20 [bridge]
[<ffffffffa03c087e>] ? br_handle_frame_finish+0x27e/0x2a0 [bridge]
[<ffffffffa03c5820>] ? nf_bridge_alloc+0x30/0xc0 [bridge]
[<ffffffffa03c64a8>] ? br_nf_pre_routing_finish+0x228/0x340 [bridge]
[<ffffffffa03c6a1f>] ? br_nf_pre_routing+0x45f/0x760 [bridge]
[<ffffffff814846e9>] ? nf_iterate+0x69/0xb0
[<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
[<ffffffff81484bc4>] ? nf_hook_slow+0xa4/0x140
[<ffffffffa03c0600>] ? br_handle_frame_finish+0x0/0x2a0 [bridge]
[<ffffffffa03c0a2c>] ? br_handle_frame+0x18c/0x250 [bridge]
[<ffffffff8145bd29>] ? __netif_receive_skb+0x569/0x740
[<ffffffff8145b8f0>] ? __netif_receive_skb+0x130/0x740
[<ffffffff8145aae6>] ? get_rps_cpu+0x126/0x3b0
[<ffffffff8145a9c0>] ? get_rps_cpu+0x0/0x3b0
[<ffffffff8145c058>] ? netif_receive_skb+0x58/0x60
[<ffffffff8145c160>] ? napi_skb_finish+0x50/0x70
[<ffffffff814f73f4>] ? vlan_gro_receive+0x84/0xa0
[<ffffffffa02fc593>] ? ixgbe_poll+0xd43/0x1410 [ixgbe]
[<ffffffff8145c9f8>] ? net_rx_action+0x188/0x3a0
[<ffffffff8145c970>] ? net_rx_action+0x100/0x3a0
[<ffffffff81076a0d>] ? __do_softirq+0xdd/0x200
[<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
[<ffffffff8100dfdd>] ? do_softirq+0xad/0xe0
[<ffffffff810765f5>] ? irq_exit+0x95/0xa0
[<ffffffff81526de5>] ? do_IRQ+0x75/0xf0
[<ffffffff8100ba93>] ? ret_from_intr+0x0/0x16
<EOI> [<ffffffff812ecf18>] ? intel_idle+0xe8/0x170
[<ffffffff812ecf11>] ? intel_idle+0xe1/0x170
[<ffffffff81524540>] ? __atomic_notifier_call_chain+0x0/0xa0
[<ffffffff814253c7>] ? cpuidle_idle_call+0xa7/0x150
[<ffffffff81009e0b>] ? cpu_idle+0xbb/0x110
[<ffffffff81503f6a>] ? rest_init+0x7a/0x80
[<ffffffff81b89fa7>] ? start_kernel+0x456/0x462
[<ffffffff81b8933a>] ? x86_64_start_reservations+0x125/0x129
[<ffffffff81b89438>] ? x86_64_start_kernel+0xfa/0x109
Does this look like a driver, bridge or netfilter issue?
Using the same packet capture trace, I reran the test against the kernel's ixgbe driver (3.4.8-k) and the system remains stable.

