We do sometimes see MEMRD errors with sccBMc -i, but if the training completes successfully, operation does not appear to be affected. But we are looking into this issue.
Your other issue is more serious.
Can you (as root) reinstall the crbif module and see what happens. First remove it .. rmmod crbif.
Then before running install.csh again, remove and reinstall dkms.
apt-get remove --purge dkms
apt-get install dkms
Then go to /opt/sccKit/1.3.0/firmware and run install.csh.You should not see errors.
I attached a file showing the commands I ran this morning.
Can you then do a lspci -vvv and post your output? Thanks.
installing_crbif.txt.zip 5.3 K
I checked about those MEMRD errors and yes, they are OK. They indicate that the DDR23 training failed. When this failure occurs, the training is restarted. So the error should be followed by a retraining and an error-free section. If the re-training fails, sccBmc will report that itr failed. If sccBmc reports that it passed, then you should be able to proceed.
Thanks for your help. I followed your instructions and reinstalled the driver, but unfortunately the problem persists. The last line displayed by sccBoot before the freeze is still "found object for MC x=0, y=0"; the next line according to your output ("WriteMemFromOBJ(...): Configuration of memory done") never appears on this machine.
Output of lspci -vvv is attached.
I also tried to compile the driver from the SVN repository, but it shows the same problem. After enabling full debug output (setting mcedev_debug to 0xFFFFFFFF), the following two lines are printed repeatedly (to dmesg running in another terminal):
crbif_write: MIP: 32 free, requested: 3168 bytes
crbif_write: No Packets to write
After rebooting and disabling the MCEDBG_READWRITE (0x1000) flag to shut off messages from crbif_write, only some messages from the driver's low-level DMA code are output to the debug log; please see the attached screenshots crbif.freeze.0.jpg and 1.jpg for details.
What caught my eyes is that after the last two messages starting with "crbif_doDma" (timestamp 526.932531), the corresponding message from the ISR is missing (it usually occurs almost immediately, having something like 3 microseconds between doDma and ISR). Afterwards, only messages from crbif_prepareDma are logged, which seems to indicate the application is just filling the ring buffer, but no data is transferred to the FPGA anymore. Once the ring buffer is full (timestamp 527.322111), no further messages appear until around 46 seconds later, when the kernel logs "BUG: soft lockup".
Do you think I should continue trying to get it running on this PC, or would it be more worthwhile to start from scratch on another mainboard? If so, is there any particular model (or brand) you can recommend?
I just tried running sccBoot again and got a shorter message log before the machine froze (please see attached image for details). This time, there are only 5 calls to crbif_prepareDma between the last doDma and soft lockup message.
It seems the number of DMA operations it can complete is quite random.
crbif.freeze.JPG 1.4 MB
You may have already done this, but did you look at Michael's post concerning the Par Lab issue. There he suggested doing a
apt-get remove crbif-dkms
and then reinstall dkms.
I've escalated this bug into a blocker bug in our Bugzilla database so that it can get more internal attention. It's bug 61. Do you have a bugzilla account with us. If not, please make one and add yourself to the cc list on the bug.
Yes, I read the post and removed the dkms package completely, as per the instructions you attached to your first reply; removing dkms also removes crbif-dkms. Interestingly, I needed to do so to get /opt/sccKit/1.3.0/firmware/install.csh running again; if the same driver version is already installed, re-running dkms through the install.csh script fails because the symbolic link to the driver's source code is missing.
Thanks for filing the bug. I don't have a bugzilla account yet, but have just signed up for one. Now waiting for the confirmation mail.
Can you recommend any specific mainboard for building an MCPC? Besides the freezing issue, the current one has a rather annoying BIOS bug as it won't boot at all if the SCC is powered on. I need to halt the boot process at the GRUB2 menu, then power on the SCC via a secondary PC before continuing to start the MCPC.
For motherboard recommendations, please look at your welcome letter.
Intel BOXDX58SO LGA 1366 Intel X58 ATX Intel Motherboard
Intel Server Systems SR1630GP, SR1630HGP, SR1630GPRX, and SR1630HGPRX
Just wanted to check (I suspect you have done the following) but
Can you telnet into the BMC? .... tn <BMC IP address> 5010
Can you bring up the sccGui? ... sccGui &
Can you train with the Gui?
What happens if you try to boot Linux with the Gui?
Did you set your BMC IP address as described in http://communities.intel.com/docs/DOC-5313?
Thank you, we'll order one of those boards as soon as possible.
The MCPC currently is a Dell Optiplex 755, with the PCIe card in the PEG slot (which unfortunately is the only PCIe slot on the board). I assume this is the reason why the PC doesn't boot with the SCC powered on; the BIOS seems not to be very happy with anything but a graphics card in that slot. I also tried an older P4 from FSC, but it didn't recognize the SCC at all.
MCPC is running Ubuntu 10.04.1 64-bit, as I wasn't able to find 10.04 when I installed the system 4 days ago. This doesn't seem to affect the driver installation, though; the install.csh script of both the 1.2.3 and 1.3.0 sccKit ran without error, and both resulted in a working crbif ("working" as in: driver loaded on system start, and being able to perform an "sccBmc -i" without error).
The system has an additional ethernet card for the BMC network, and the BMC uses its default IP address of 192.168.2.127. Telnet to the BMC works well.
sccGui can be brought up, but I haven't tried training or booting Linux this way; I always used the command line utilities.
"sccBmc -i" works without errors (the MEMRD errors show up every 2nd to 3rd try, but they are always gone after the automatic retraining).
"sccBoot -l" always hangs while transferring the memory image, regardless of whether it is invoked for all cores or for specific single one.
I will try training and booting Linux via the GUI once I am at the university again. I'll attend ERIC at Braunschweig, so this won't be before Thursday.
Attending ERIC may solve all your issues.You'll get to meet the sccKit authors at ERIC.
Using an updated version of Ubuntu 10.04 is fine. We update our local MCPC systems.
I haven't seen anyone use the default BMC address, but I thought about this a while, and I cannot come up with a good reason not to. If anyone else in the community has one, please post it here.
The key, of course, is that the BMC IP address be static, and that /etc/network/interfaces be set up correctly. Look at the file How to Set Up Your MCPC for a sample interfaces file. This file is very finicky... no spaces are allowed between the last character on a line and the line terminator. Check the file both before and after booting the MCPC.
I did notice some differences between the output of my lspci -vvv and yours, and I posted those in the attached file.
Here at the Univerity of Amsterdam we experience exectly the same. We can see the board (lspci), telnet, ssh, initialize it (sccBmc -i), but when we try to boot linux (both sccGui and sccBoot), or use do the memtest from the sccGui, it freezes.
Our MCPC is a HP ProLiant ML110 G6.
I attached the output of lspci -vvv, the output of sccBmc -i and sccBoot -l 0 (just after reinstallation of crbif-dkms, as described in your post and installing_crbif.txt.
dmesg is full of these messages:
[ 1487.920783] crbif_doDma: Error in MIP/MOP FIFOs: 0x20000000
[ 1487.920787] crbif_daemon: Could not trigger DMA transfer
[ 1488.919103] crbif_doDma: Error in MIP/MOP FIFOs: 0x20000000
[ 1488.919107] crbif_daemon: Could not trigger DMA transfer
My previous post referred to an attached file that was missing. I attached it to this post.
This issue is very baffling. Note that it is escalated as a blocking bug in our Bugzilla. It is bug 61.
At this point, I don't think it is configuration or user error. I think the problem may be hardware. This may not mean that some hardware is broken; there may be a compatibility issue.
We have distributed an sccProductionTest with sccKit. This test was run on all systems before they were sent out. Can you try it on your system and post your results? I attached a file that shows the output when I ran it locally.
This command is in /opt/sccKit/current/bin. It must be run as root (so root must have the path to the sccKit binaries). You also should be in the directory /opt/sccKit/current when you run it.
as expected, also this one fails, and hangs at the same point. I registered myself to the CC list for bug 61.
INFO: Welcome to sccReset 1.3.0 (build date Aug 25 2010 - 15:55:54)...
INFO: Applying global software reset to SCC (cores & CRB registers)...
INFO: Welcome to sccBoot 1.3.0 (build date Aug 25 2010 - 15:55:06)...
INFO: Starting to boot Linux: All cores!
INFO: Using linux image "/opt/sccKit/1.3.0/resources/linux.obj" (default image as defined by sccGui "Settings->Linux boot settings")...
INFO: Preloading Memory with pre-merged linux object file "/opt/sccKit/1.3.0/resources/premerge_image_0_0.32.obj"...
This discusssion is now being carried on in Bugzilla, bug 61. If you are interested in this issue, please go to our Bugzilla, make yourself an account if you don't already have one, and add yourself to the CC list for this bug.
Sorry for answering to this forum instead of bugzilla, but my account has not been activated yet...
I spoke to the SCC developers at ERIC, and the theory is that there are some transmission errors on the PCIe bus, which ultimately cause the FPGA to enter a wrong state and become unresponsive. There is no known fix or workaround but to try another mainboard. We have already ordered one of the suggested ones (Intel BOXDX58SO) and expect it to be delivered today or tomorrow. I'll report back once there are any new results.
In the meantime, I'm experimenting with the old setup that experiences the freezing problems. Interestingly, I have been able to get Linux up and running on 24 cores. When trying it on all 48 cores, I get the same errors as Roy; however, when using 24 or less (sorry, haven't tested anything in between yet), it just works. In text mode, the message "NOHZ: local_softirq_pending 08" is occasionally displayed on the console, but that doesn't seem to be a problem.
In short, all I did was to disable DMA transfers in the crbif module (e.g., "insmod crbif.ko usePioOnlyRead=1 usePioOnlyWrite=1"). This forces the driver to use PIO all the time, which seems to fix (at least some of) the transmission errors. I can successfully ping active cores afterwards.
I'm running a memory test via sccGui now, which also seems to work flawlessly (currently at 30%). When trying to do so with DMA active, it would freeze almost immediately. How long is the test expected to take?
By the way: as I can successfully ping booted Linux cores, as well as reach the BMC via telnet from the MCPC, I assume my /etc/network/interfaces file is fine.