1 of 1 people found this helpful
Hmm, Not a simple card and very likely to get very deep real quick
I could start by blaming the card, but 1) that is rude and 2) why would anyone do that without supporting data?
- Selected MMIO size of 256G -- Take it all the way to max. (1024G)
May not help, but should not hurt.
You already said you disable the on-board devices. NICs, and HDD controlers? At least as a test.
Next step would require a PCIe config space dump with the mutiple cards installed. (Though you may want to do this first with no cards installed since i have a feeling it is going to be big)
The best way to do this is to insert a USB key to collect the data,
Then at the shell prompt:
SHELL> map -r
(whole bunch of dirve info dumps out)
usb fob likley will be fs0: or fs1: so type that
FS0:> PCI > PCIDUMP.txt
Makes a file listing ALL the pci devices in the system (long list and much longer if you have the cards installed)
from this list you can see if the Board & BIOS is seeing the cards.
Next step would be to loog at the indivual cards configuration in PCI config space by looking in the list from the first dump to identify the Bus Device Function of the card then type:
FS0:>PCI xx xx xx -i >> pcidump.txt
where the XX's are the BUS Device Function.
By reviewing the dump from all the devices on the cards you should be able to glean some information such as how much MMIO it really wants or if the cards are throwing configuration errors.
I saw a comment on the card vendors site about the cards taking a long time to come up before you can configure the PCI space which these dumps would show also.
How that helps and did not get too deep.
Thanks for not punting on the DSP card. It may well turn out to be a card issue but the vendor is not convinced based on my description to this point. I need a more definitive way to either implicate or rule out their card(s).
To that end, I tried each of your suggestions. First, I took the largest MMIO space setting available from the BIOS UI menu. The POST still failed after that change so I moved on to the UEFI shell.
The 'pci' output did not contain any indication of either card when two were seated. A pastebin of the output is here. I removed one card and called pci again. The resulting output can be found here. I notice that even the bridge device is missing when both cards are seated.
I went on to run 'pci -i' on the devices in the single card configuration. The dump from the first of four TI devices is here. A dump of the first of five PLX bridge devices is here. I have the dumps for the rest of the TI and PLX devices. How would I go about using this information to determine MMIO allocation requirements?
I would think that if this were a problem related to a prolonged secondary boot cycle on either board then it would also be evident in the single card configuration. It's unclear to me how seating two cards instead of one would complicate/elongate the boot cycle of either one. The only thing that comes to mind is power limitations whereby each card competes for total available energy. That theory goes against my understanding of the standard though where each slot should provide up to 75W of independent power. Besides, I don't see any indication of a black-out or current overage in the BMC logs.
Thanks again for your feedback to this case. I've learned a bit about the UEFI CLI along the way and hopefully am one step closer to cracking this beast.
You can see why I thought this could get very deep very fast (and why so support lines want to just blame the other guy.) I should put in the standard disclaimer that “you should always select devices off the Test Hardware and OS list since they are tested and supported by Intel”. (End of speech since I don’t think Intel has any cards similar to these on the THOL).
It may take me a bit to weed through these dumps and then i am going to be asking for a few more.
Every time you boot the system, one of the early POST (power on self-test) functions is to figure out the PCI bus.
The processor sends query. Many devices within the processor respond as well as the chipset and any devices installed in the riser slots. As the processor finds these devices it assigns a Bus and Device number and reads the Function numbers from the device. It then proceed to negotiate with the device to establish bus width (x1,x2,x4,x8,x16) and which generation of PCIe the device and the mother board can support in common, memory or interrupts needed, etc.
It also reads setup info that tells the processor if additional device may be behind the first device and then goes looking for the additional devices.
In the event something goes wrong, error information can get logged in a couple places.
The first and easiest to read is the System Event Log (SEL) (oops - fat fingered the keyboard so now I am editing)
The SEL can be read using the SELVIEW tool found here:
In the SEL you are looking for anything related to PCI, Things like PCIe link down, malformed TTLP or malformed DTLP. PCI FAT error (no idea who abbreviated that one. FAT = Fatal error) If you see anything like these save both the Text version and the HEX version as they can offer additional clues.
The second place which should contain more information is in the PCI error registers of the upstream device. (The downstream device being your cards that are not showing up)
The dump you posted with one card installed was very helpful since it identifies the upstream device for us. (Assuming you left that card plugged into the same slot)
Message was edited by: Doc_SilverCreek --- Must be getting late. I keep hitting the wrong keys) One of the items the PCI -i dump gives us is the sub bus structure. In the case of the PLX bridge at address 07 04 00 it shows me that bus 8 is below this device. and that this is the Downstream port which talks to the devices on the card. 43.(Bus Numbers) Primary(18) Secondary(19) Subordinate(1A) 44. ------------------------------------------------------ 45. 07 08 08 77. Device/PortType(7:4): Downstream Port of PCI Express Switch All well and good, but for the failing condition we need to 1) identify the upstream bus. 2) save the PCI -i info from the upstream controller with 1 card and with 2 cards installed. HINT: The upstream controller will be Intel vendor ID Vendor 8086 and a Bridge Device - PCI/PCI bridge or HOST\PCI I was looking for a picture tp illustrate and found this link which on quick read looks pretty good. http://www.tldp.org/LDP/tlk/dd/pci.html Once we have figured out how the bus structure is being set-up as in Figure 6.7: Configuring a PCI System: Part 2 we can then start looking at the error registers in the correct mother board controller to see why it failed to find the cards. In the configuration you posted I am looking for the device represented by the ?? which is identified by having a secondary bus as 07 (Bus Numbers) Primary(18) Secondary(19) Subordinate(1A) ------------------------------------------------------ ?? 07 xx It is much easier to do this than to explain it so sorry if I am unclear. I usually just script it in EFI and dump everything then search the results for the data I want.
Thanks again for the excellent advice. There is plenty for me to chew on there and more from the card vendor.
I did collect the SEL for the two card case but in text only. The log is here. I don''t see any indication of TTLP or DTLP errors. It only seems to tell me what I already knew about resource exhaustion.
Hopefully your procedure will reveal more information. I will see about crafting an EFI script to get it done.
Here is a script i use when trying to collect the full PCI space dump. (I am not a programmer )
Echo This PCI space dump will take a bit.
Echo Full dump completed time ~16hr --- Limited dump time ~2 hrs
copy log.txt old.txt
date > log.txt
time >> log.txt
pci >> log.txt
# Bus v1 v2 256 buses
for %a run (0 15)
set v1 %a
if %a == 10 then
set v1 A
if %a == 11 then
set v1 B
if %a == 12 then
set v1 C
if %a == 13 then
set v1 D
if %a == 14 then
set v1 E
if %a == 15 then
set v1 F
for %b run (0 15)
set v2 %b
if %b == 10 then
set v2 A
if %b == 11 then
set v2 B
if %b == 12 then
set v2 C
if %b == 13 then
set v2 D
if %b == 14 then
set v2 E
if %b == 15 then
set v2 F
# Device v3 v4 32 devices by spec can be reduced to increase speed.
# I know of no items with more than 10 devices
# set loop varable C to 0 1 for all possiable devices
# set loop varable C to 0 0 for 16 devices
# set loop varable C to 0 0 and D to 0 10 for 10 devices
for %c run (0 0)
set v3 %c
if %c == 10 then
set v3 A
if %c == 11 then
set v3 B
if %c == 12 then
set v3 C
if %c == 13 then
set v3 D
if %c == 14 then
set v3 E
if %c == 15 then
set v3 F
for %d run (0 10)
set v4 %d
if %d == 10 then
set v4 A
if %d == 11 then
set v4 B
if %d == 12 then
set v4 C
if %d == 13 then
set v4 D
if %d == 14 then
set v4 E
if %d == 15 then
set v4 F
# function V5 v6 8 functions max per specification
for %e run (0 0)
set v5 %e
if %e == 10 then
set v5 A
if %e == 11 then
set v5 B
if %e == 12 then
set v5 C
if %e == 13 then
set v5 D
if %e == 14 then
set v5 E
if %e == 15 then
set v5 F
for %f run (0 7)
set v6 %f
if %f == 10 then
set v6 A
if %f == 11 then
set v6 B
if %f == 12 then
set v6 C
if %f == 13 then
set v6 D
if %f == 14 then
set v6 E
if %f == 15 then
set v6 F
echo Bus %v1%%v2% Device %v3%%v4% Function %v5%%v6%
pci %v1%%v2% %v3%%v4% %v5%%v6% -i >> log.txt
date >> log.txt
time >> log.txt
set -d v1
set -d v2
set -d v3
set -d v4
set -d v5
set -d v6
Thanks very much for your script. The card vendor is going to update their firmware which I will retest. I'm sure your script will come in handy if the failures persist.
Their fix will reduce the per-card memory request from 512MB to 128MB over 5 windows. While a total of 512MB per card seems like a lot, the allocation failure is still surprising to me given the maximum MMIO setting you suggested earlier in the case. Shouldn't that have provided a huge I/O space?
I was reading the card spec today (which did not help much) but got to wondering about the embedded NIC.
The thought I had was that the dual NIC's maybe loading Boot Option ROMs. (There is not a lot of option rom space since it has to be loaded hooks in lower memory)
If this is the case, you can disable the PCI oproms in BIOS setup, but it is a little tricky
With one card installed you need to:
- Boot and go into BIOS setup
- Select Advance - PCI configuration - PCI oprom
- Disable the option rom.
- Power down,
- Install the next card
- Repeat from step 1 until all cards are installed and Oprom's disabled,
The Oprom disable tab is kind of a catch 22.
The tab only shows up BIOS detects and configures a card during POST that contains an option rom, but if too much Oprom space is used, BIOS can't configure the card so it never shows up and hence you can't disable it either.
May not be the issue, but I figured it is worth a try.