I've been working with VMware support for some time now trying to get some progress made on a PSOD we had with an ESX 4.1 host. VMware's fault team / engineering can't say exactly what went wrong but they claim it was ixgbe related. This case has been open for almost two months now and I'm hoping that I can get someone from intel to help escalate this issue on the intel side.
Here's the information they've provided me from the PSOD analysis:
From the backtrace:
3:04:01:55.202 cpu6:4330)0x417f80757eb0:[0x418019792d6c]ixgbe_configure@esx:nover+0x73 stack: 0x41000fe86270, 0x41000fe86270
3:04:01:55.203 cpu6:4330)0x417f80757ed0:[0x418019794195]ixgbe_up@esx:nover+0x10 stack: 0x41000fe86270, 0x0, 0x417f80757f90,
3:04:01:55.204 cpu6:4330)0x417f80757ef0:[0x4180197942c6]ixgbe_reinit_locked@esx:nover+0x8d stack: 0xa5cf6c000001006, 0xfe823
3:04:01:55.205 cpu6:4330)0x417f80757f90:[0x41801966c1c0]vmklnx_workqueue_callout@esx:nover+0x11b stack: 0x417f80757ff0, 0x41
From the logs prior to the crash:
3:04:01:55.082 cpu14:4169)<3>ixgbe: ixgbe_netqueue_ops: Unhandled NETQUEUE OP 16
3:04:01:55.083 cpu14:4169)NetPort: 2232: resuming traffic on DV port 1757
3:04:01:55.083 cpu14:4169)NetPort: 982: enabled port 0x4000005 with mac 00:00:00:00:00:00
3:04:01:55.083 cpu14:4169)<3>ixgbe: vmnic7: ixgbe_alloc_tx_queue: allocated tx queue 1
3:04:01:55.083 cpu14:4169)<3>ixgbe: vmnic7: ixgbe_alloc_tx_queue: allocated tx queue 2
3:04:01:55.083 cpu14:4169)<3>ixgbe: vmnic7: ixgbe_alloc_tx_queue: allocated tx queue 3
3:04:01:55.083 cpu14:4169)<3>ixgbe: vmnic7: ixgbe_alloc_tx_queue: allocated tx queue 4
3:04:01:55.083 cpu14:4169)<3>ixgbe: vmnic7: ixgbe_alloc_tx_queue: allocated tx queue 5
3:04:01:55.083 cpu14:4169)<3>ixgbe: vmnic7: ixgbe_alloc_tx_queue: allocated tx queue 6
3:04:01:55.083 cpu14:4169)<3>ixgbe: vmnic7: ixgbe_alloc_tx_queue: allocated tx queue 7
I've highlighted the vmkernel log entry that looks suspicious to vmware support and I agree with them. I get these other messages a lot from ixgbe about tx queue allocation so I think they are fine.
We do not have any ixgbe module options passed to the driver. I'll post back with the specific driver that was running as I dont have that detail in front of me right this moment.
If I should be posting this elsewhere please let me know.
You are following the best option by working with VMware support on the issue. I am sorry to hear that a solution is taking a long time. If anyone in the communities is aware of a workaround or other solution, your best place for finding that information would be in the VMware forums at http://communities.vmware.com/index.jspa.
I also passed on the information you posted to the Intel Ethernet virtualization software team. Unfortunately, there is no solution available at this time that we know about.
Thanks for your response. The vmware escalation engineer actually was the person who suggested that I reach out to the intel support community to try and get some additional response on the problem so I don't feel like making a post on the vmware community site would be of as much help... although it might be informative for other esx users that have ixgbe cards that do an upgrade to 4.1 and experience the same problem on the off chance they search there. I'd imagine they'd search kb.vmware.com or open a support request.
I'll keep hoping that someone will find this post and will engage with vmware support on this so we can get the problem put to bed and the case closed.
Anyone already made experiments with v3.4.23 from http://downloads.vmware.com/d/details/dt_esxi4x_intel_10g_82598/ZHcqYnQld2hiZGhwZA== ?
Since we haven't heard from VMware that this fixes the issue, I wouldn't try myself.
Especially since we weren't able to reproduce the PSOD when running load tests in development.
Apparently VMware are in touch with Intel and we could deliver several PSODs to them but noone can seem to find an issue.
We have also asked our hardware manufacturer Fujitsu, no definitive answer yet.
Here is an interesting side not that is not related to the PSOD, but does explain the log message:
3:04:01:55.082 cpu14:4169)<3>ixgbe: ixgbe_netqueue_ops: Unhandled NETQUEUE OP 16.
It turns out that this message is harmless. The developer who wrote this code came to me today to explain the message. I cannot improve on the explanation from the developer, so here is what he said:
This [message] comes from the ixgbe driver built with the ESX 4.0 API tools being run in ESX 4.1, which has a slightly larger API. Per VMware specifications, the ESX 4.0 drivers can be used safely on ESX 4.1, which is the case for ixgbe.
The driver is complaining about a request from the ESX 4.1 core that it doesn’t know how to handle. This particular request is simply ESX trying to ask the driver if certain advanced features are supported. The 4.0 driver doesn’t know about the 4.1 request, so spits out a little complaint message before returning a “fail” status to ESX; the fail status lets ESX know that the advanced features are not supported.
This does not explain the PSOD, but at least we now know that you can safely ignore this log message.