"0146 - PCI out of resource" blocks POST of our S2600IP-based system

CColg · ‎05-02-2013

We're trying to use up to three DSP cards from Advantech (http://www.advantech.com/products/DSPC-8681/mod_A9314996-0022-4927-9AA4-6B1060D4E5E8.aspx http://www.advantech.com/products/DSPC-8681/mod_A9314996-0022-4927-9AA4-6B1060D4E5E8.aspx) in this system. Ideally we will be able to load all of the add-on slots but we're nowhere near that level yet. Each card alone in any x8 or x16 slots passes POST but no seating combination of two or more cards passes.

Have any of you ever encountered a problem like this with a S2600? I have followed generic remedies to the problem on the Intel site (http://www.intel.com/support/motherboards/server/sb/CS-034070.htm http://www.intel.com/support/motherboards/server/sb/CS-034070.htm). I have also seen discussions here regarding multiple NICs but not arbitrary add-ons and certainly not these DSP cards.

Steps I've taken so far:

Updated BIOS/firmware
Disabled unused devices and OpROMs in BIOS
Made the following changes to the BIOS (BIOS Setup => Advanced => PCI Configuration):
- Set Maximize Memory below 4 GB to Disabled
- Set Memory Mapped I/O above 4 GB to Enabled
- Selected MMIO size of 256G
Tried every combination of x8 and x16 seats for two cards
Collected System Event log using Intel utility
- Added nothing to output already available in the UI
Collected System Info using Intel utility
Engaged Advantech with questions regarding this use of their board
- They feel it is a mother-/main-board issue and won't investigate further

I'm relatively sure this is a BIOS/firmware issue but have tried every reasonable configuration point in the UI. It is hard to imagine that we really are exhausting PCI resources with a grand total of 8 DSP chips in the system.

Thanks for any suggestions,

Chad

DSilv11 · ‎05-02-2013

Hmm, Not a simple card and very likely to get very deep real quick

I could start by blaming the card, but 1) that is rude and 2) why would anyone do that without supporting data?

Selected MMIO size of 256G -- Take it all the way to max. (1024G)

May not help, but should not hurt.

You already said you disable the on-board devices. NICs, and HDD controlers? At least as a test.

Next step would require a PCIe config space dump with the mutiple cards installed. (Though you may want to do this first with no cards installed since i have a feeling it is going to be big)

The best way to do this is to insert a USB key to collect the data,

Then at the shell prompt:

SHELL> map -r

(whole bunch of dirve info dumps out)

usb fob likley will be fs0: or fs1: so type that

shell> FS0:

FS0:>

Then

FS0:> PCI > PCIDUMP.txt

Makes a file listing ALL the pci devices in the system (long list and much longer if you have the cards installed)

from this list you can see if the Board & BIOS is seeing the cards.

Next step would be to loog at the indivual cards configuration in PCI config space by looking in the list from the first dump to identify the Bus Device Function of the card then type:

FS0:>PCI xx xx xx -i >> pcidump.txt

where the XX's are the BUS Device Function.

By reviewing the dump from all the devices on the cards you should be able to glean some information such as how much MMIO it really wants or if the cards are throwing configuration errors.

I saw a comment on the card vendors site about the cards taking a long time to come up before you can configure the PCI space which these dumps would show also.

How that helps and did not get too deep.

FS0:> PCI

o

CColg · ‎05-03-2013

Thanks for not punting on the DSP card. It may well turn out to be a card issue but the vendor is not convinced based on my description to this point. I need a more definitive way to either implicate or rule out their card(s).

To that end, I tried each of your suggestions. First, I took the largest MMIO space setting available from the BIOS UI menu. The POST still failed after that change so I moved on to the UEFI shell.

The 'pci' output did not contain any indication of either card when two were seated. A pastebin of the output is http://pastebin.com/X8n1uHVG here. I removed one card and called pci again. The resulting output can be found http://pastebin.com/CzqvFr7u here. I notice that even the bridge device is missing when both cards are seated.

I went on to run 'pci -i' on the devices in the single card configuration. The dump from the first of four TI devices is http://pastebin.com/5jhFzVwc here. A dump of the first of five PLX bridge devices is http://pastebin.com/GB4ufTmd here. I have the dumps for the rest of the TI and PLX devices. How would I go about using this information to determine MMIO allocation requirements?

I would think that if this were a problem related to a prolonged secondary boot cycle on either board then it would also be evident in the single card configuration. It's unclear to me how seating two cards instead of one would complicate/elongate the boot cycle of either one. The only thing that comes to mind is power limitations whereby each card competes for total available energy. That theory goes against my understanding of the standard though where each slot should provide up to 75W of independent power. Besides, I don't see any indication of a black-out or current overage in the BMC logs.

Thanks again for your feedback to this case. I've learned a bit about the UEFI CLI along the way and hopefully am one step closer to cracking this beast.

DSilv11 · ‎05-04-2013

You can see why I thought this could get very deep very fast (and why so support lines want to just blame the other guy.) I should put in the standard disclaimer that "you should always select devices off the Test Hardware and OS list since they are tested and supported by Intel". (End of speech since I don't think Intel has any cards similar to these on the THOL).

It may take me a bit to weed through these dumps and then i am going to be asking for a few more.

Some background

Every time you boot the system, one of the early POST (power on self-test) functions is to figure out the PCI bus.

The processor sends query. Many devices within the processor respond as well as the chipset and any devices installed in the riser slots. As the processor finds these devices it assigns a Bus and Device number and reads the Function numbers from the device. It then proceed to negotiate with the device to establish bus width (x1,x2,x4,x8,x16) and which generation of PCIe the device and the mother board can support in common, memory or interrupts needed, etc.

It also reads setup info that tells the processor if additional device may be behind the first device and then goes looking for the additional devices.

In the event something goes wrong, error information can get logged in a couple places.

The first and easiest to read is the System Event Log (SEL) (oops - fat fingered the keyboard so now I am editing)

The SEL can be read using the SELVIEW tool found here:

https://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=21003&lang=eng&OSVersion=&DownloadType https://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=21003&lang=eng&OSVersion=&DownloadType=

In the SEL you are looking for anything related to PCI, Things like PCIe link down, malformed TTLP or malformed DTLP. PCI FAT error (no idea who abbreviated that one. FAT = Fatal error) If you see anything like these save both the Text version and the HEX version as they can offer additional clues.

The second place which should contain more information is in the PCI error registers of the upstream device. (The downstream device being your cards that are not showing up)

The dump you posted with one card installed was very helpful since it identifies the upstream device for us. (Assuming you left that card plugged into the same slot)

Message was edited by: Doc_SilverCreek --- Must be getting late. I keep hitting the wrong keys)&# 13; &# 13; One of the items the PCI -i dump gives us is the sub bus structure. &# 13; In the case of the PLX bridge at address 07 04 00 it shows me that bus 8 is below this device. &# 13; and that this is the Downstream port which talks to the devices on the card. &# 13; &# 13; 43.(Bus Numbers) Primary(18) Secondary(19) Subordinate(1A)&# 13; 44. ------------------------------------------------------&# 13; 45. 07 08 08&# 13; &# 13; 77. Device/PortType(7:4): Downstream Port of PCI Express Switch&# 13; &# 13; All well and good, but for the failing condition we need to &# 13; &# 13; 1) identify the upstream bus.&# 13; 2) save the PCI -i info from the upstream controller with 1 card and with 2 cards installed. &# 13; &# 13; HINT: The upstream controller will be Intel vendor ID Vendor 8086 and a Bridge Device - PCI/PCI bridge or HOST\PCI&# 13; &# 13; I was looking for a picture tp illustrate and found this link which on quick read looks pretty good. http://www.tldp.org/LDP/tlk/dd/pci.html http://www.tldp.org/LDP/tlk/dd/pci.html&# 13; &# 13; Once we have figured out how the bus structure is being set-up as in Figure 6.7: Configuring a PCI System: Part 2 we can then start looking at the error registers in the correct mother board controller to see why it failed to find the cards. &# 13; &# 13; In the configuration you posted I am looking for the device represented by the ?? which is identified by having a secondary bus as 07 &# 13; (Bus Numbers) Primary(18) Secondary(19) Subordinate(1A)&# 13; ------------------------------------------------------&# 13; ?? 07 xx&# 13; &# 13; It is much easier to do this than to explain it so sorry if I am unclear. &# 13; I usually just script it in EFI and dump everything then search the results for the data I want. &# 13;

CColg · ‎05-07-2013

Thanks again for the excellent advice. There is plenty for me to chew on there and more from the card vendor.

I did collect the SEL for the two card case but in text only. The log is http://pastebin.com/uaKzAszR here. I don''t see any indication of TTLP or DTLP errors. It only seems to tell me what I already knew about resource exhaustion.

Hopefully your procedure will reveal more information. I will see about crafting an EFI script to get it done.

DSilv11 · ‎05-07-2013

Here is a script i use when trying to collect the full PCI space dump. (I am not a programmer )

echo -off

Echo This PCI space dump will take a bit.

Echo Full dump completed time ~16hr --- Limited dump time ~2 hrs

date

time

copy log.txt old.txt

date > log.txt

time >> log.txt

pci >> log.txt

:loop

# Bus v1 v2 256 buses

for %a run (0 15)

set v1 %a

if %a == 10 then

set v1 A

endif

if %a == 11 then

set v1 B

endif

if %a == 12 then

set v1 C

endif

if %a == 13 then

set v1 D

endif

if %a == 14 then

set v1 E

endif

if %a == 15 then

set v1 F

endif

for %b run (0 15)

set v2 %b

if %b == 10 then

set v2 A

endif

if %b == 11 then

set v2 B

endif

if %b == 12 then

set v2 C

endif

if %b == 13 then

set v2 D

endif

if %b == 14 then

set v2 E

endif

if %b == 15 then

set v2 F

endif

# Device v3 v4 32 devices by spec can be reduced to increase speed.

# I know of no items with more than 10 devices

# set loop varable C to 0 1 for all possiable devices

# set loop varable C to 0 0 for 16 devices

# set loop varable C to 0 0 and D to 0 10 for 10 devices

for %c run (0 0)

set v3 %c

if %c == 10 then

set v3 A

endif

if %c == 11 then

set v3 B

endif

if %c == 12 then

set v3 C

endif

if %c == 13 then

set v3 D

endif

if %c == 14 then

set v3 E

endif

if %c == 15 then

set v3 F

endif

for %d run (0 10)

set v4 %d

if %d == 10 then

set v4 A

endif

if %d == 11 then

set v4 B

endif

if %d == 12 then

set v4 C

endif

if %d == 13 then

set v4 D

endif

if %d == 14 then

set v4 E

endif

if %d == 15 then

set v4 F

endif

# function V5 v6 8 functions max per specification

for %e run (0 0)

set v5 %e

if %e == 10 then

set v5 A

endif

if %e == 11 then

set v5 B

endif

if %e == 12 then

set v5 C

endif

if %e == 13 then

set v5 D

endif

if %e == 14 then

set v5 E

endif

if %e == 15 then

set v5 F

endif

for %f run (0 7)

set v6 %f

if %f == 10 then

set v6 A

endif

if %f == 11 then

set v6 B

endif

if %f == 12 then

set v6 C

endif

if %f == 13 then

set v6 D

endif

if %f == 14 then

set v6 E

endif

if %f == 15 then

set v6 F

endif

echo Bus %v1%%v2% Device %v3%%v4% Function %v5%%v6%

pci %v1%%v2% %v3%%v4% %v5%%v6% -i >> log.txt

endfor

endfo...

CColg · ‎05-10-2013

Hi,

Thanks very much for your script. The card vendor is going to update their firmware which I will retest. I'm sure your script will come in handy if the failures persist.

Their fix will reduce the per-card memory request from 512MB to 128MB over 5 windows. While a total of 512MB per card seems like a lot, the allocation failure is still surprising to me given the maximum MMIO setting you suggested earlier in the case. Shouldn't that have provided a huge I/O space?

Thanks again,

Chad

DSilv11 · ‎05-10-2013

I was reading the card spec today (which did not help much) but got to wondering about the embedded NIC.

The thought I had was that the dual NIC's maybe loading Boot Option ROMs. (There is not a lot of option rom space since it has to be loaded hooks in lower memory)

If this is the case, you can disable the PCI oproms in BIOS setup, but it is a little tricky

With one card installed you need to:

Boot and go into BIOS setup
Select Advance - PCI configuration - PCI oprom
Disable the option rom.
Power down,
Install the next card
Repeat from step 1 until all cards are installed and Oprom's disabled,

The Oprom disable tab is kind of a catch 22.

The tab only shows up BIOS detects and configures a card during POST that contains an option rom, but if too much Oprom space is used, BIOS can't configure the card so it never shows up and hence you can't disable it either.

May not be the issue, but I figured it is worth a try.