We have several 82546EB (dual-port gigabit PCI NIC) cards that we use in Linux "router" boxes. One of our links is a 10 Mbit/sec metropolitan Ethernet connection to a datacenter. The carrier presents us with an RJ-45 Ethernet jack manually set to 10/full, so we set the link to 10/full on the Linux side to match. This has worked without problem for some time. During a recent unplanned reboot of the router the link appeared to go dead. Since it was during the day there was a relatively high volume of traffic flowing over the link (or trying to). After a lot of headache we determined that the link would stop responding if (and only if) there was "a lot" of traffic during the first minute or so after the link was brought up. If the link saw only minimal traffic during the first 5 (being conservative) minutes after coming up then it would stay up indefinitely.
Other than traffic not flowing not much happens to indicate anything is wrong. The switch side does not see any events. The OS side sometimes logs one or several link down/up event pairs but the final state is always "up" whether or not it's actually working.
To recover, the link just needs to be brought down and up without a high packet volume. ifconfig down/up is sufficient, no physical link changes or rebooting is required. Occasionally the link will recover on its own after 3-4 minutes of inactivity (and numerous down/up link state messages), but this is not typical.
The issue is present with the Linux e1000 driver under FC6, CentOS 5.3 and CentOS 5.4. It is ALSO present on FreeBSD 8.0-RC2 (using the em driver). We have reproduced the issue with two different managed switches and two completely different computers (moving one of the cards between them).
The issue can be reproduced fairly easily doing something like this:
Set the Ethernet port at the far end (managed switch in our case) to 10/full.
Set the e1000/em interface to 10/full and assign IP's/routes as apropriate
Watch the system log if desired (in its own terminal): tail -F /var/log/messages
Start a flood ping (in its own terminal preferrably). e.g. ping -nf somehost
Observe minimal (FreeBSD) or no (Linux) packet loss = normal behavior
With the flood ping still running, take down the interface (e.g. ifdown eth2 or ifconfig em0 down).
Observe packet loss, no route to host, etc on the flood ping.
Bring the interface back up (e.g. ifup eth2 or ifconfig em0 up).
Watch the ping again. Packets start flowing shortly after the link comes up again but then usually stop after several seconds.
Continue to watch if desired. Most of the time nothing else happens (stays broken). Sometimes will go through several cycles of what appear to be adapter resets (link up/down messages, packets get through for a second or two after each reset). Sometimes when this happens the adapter will recover (packets get through after a reset and it doesn't die again).
To recover, kill the flood ping, take the interface down, bring it back up, observe that (light) traffic flows, wait a couple minutes, resume flood ping.
Since this behavior doesn't seem to be specific to any one NIC, driver, OS or computer/chipset my suspicion is it's something in the NIC hardware or firmware. We have only seen this behavior with manually-set links at 10/full. Autonegotiated links or manual 100/full links work fine. The number of people using gigabit cards at 10 Mb/s could understandably be small.. We haven't (yet) tested any Intel gigabit cards other than the 82546EB units we have.
We will be using other cards (probably PRO/100) as a workaround, but I'm wondering if Intel is interested in analyzing and potentially fixing this issue. Aside from being broken and annoying, it's a potential vector for a DoS attack. If anyone else has seen this or has any suggestions (or wild speculation) that would be welcome as well.
If I can/should provide any additional information just let me know what. Thanks!