The first one is connected to a 10/100 embedded device, the second one to PC with a 1Gb NIC, and the last one to a GigE camera. No switches in between. The problem occurs in a similar way in four identical systems. Thus, I believe we can rule out hardware failures.
The operating system in question is Debian Squeeze, running Linux kernel 2.6.32-5-amd64 by default. It comes with an old e1000e driver module.
Most of the time, the connections work fine. However, any of the three devices may go down after a seemingly random amount of time with no clear reason. What I see in system logs is this:
Aug 17 22:08:56 XXX kernel: [22144.179804] e1000e: eth0 NIC Link is Down
Aug 17 22:08:57 XXX kernel: [22145.806317] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Sometimes, it is only one of the devices (in this case eth0), sometimes two, sometimes all three. The order in which the links go down follows no pattern. After the failure, the link seems to recover, but it makes the camera to lose some data, and its connection.
I updated the e1000e driver to the latest one found on Intel's site (188.8.131.52). I used "make CFLAGS_EXTRA=-DDISABLE_PM install" to disable power management because this was suggested by some people who had had problems with the driver. The new driver loaded fine but didn't solve the problem.
I tried setting InterruptThrottleRate=3000,3000,3000 to module parameters. No luck. Setting the rate to 10000 however seemed to increase the frequency of the problem.
Then, I updated the Linux kernel to 3.2.0 and recompiled the driver as well. That didn't help either, and now I'm out of ideas. Is there anything else I could try out?
It seemed to be the new kernel rather than the InterruptThrottleRate setting that made the problem more frequent. Before kernel update, the problem occured about every 8 hours on average. After the update, the time went closer to half an hour.
Another thing I forgot to mention that I have disabled both pcie_aspm and acpi at boot. Since PM is also disabled in the driver at compile time, it is unlikely that this is a power management issue.