we are again having some trouble with our SCC (RockyLake) setup. I can successfully start SCC Linux on all 48 cores, but the PCIe link becomes unresponsive if too many cores are performing too much I/O to the /shared file system at the same time.
Once in this state, sccGui reports timeouts during the DMA transfers. Sometimes, it logs a single line that an unexpected packet had been received (with varying command codes), but this message does not appear all the time. The crbif driver logs error messages to the text-mode console: MIP/MOP status 0x20000000, unable to start DMA transfer.
The SCC performance meter no longer displays the current CPU load; the graphs are advancing, but only plot the same (last-known) value. Also, network traffic to the SCC is no longer possible, and "ping rck**" times out.
Occasionally, after some time, I have seen the screen of the MCPC go black (as if the X server was shut down), and the system then locking up completely. If this is the case, I cannot even reboot the MCPC without a hard power cycle on the SCC first; it displays the BIOS screen, but Linux never comes up. It just sits at a black screen. If the MCPC is still responsive, I can call sccPowercycle, but it usually (roughly 9 out of 10 tries) does not fix the problem; even re-training the SIF fails, until the SCC has been hard-powercycled.
To generate I/O load, we use a recompiled SCC Linux and some scripts. The Linux code is identical to the one in the public repository, with one change: in rckos/fs/etc/init.d/tc-config, I appended the following:
# HPI: Additional init script
if [ -f /shared/coreInit/init.sh ]
echo "%%%% Invoking /shared/coreInit/init.sh for PID$core" > /shared/coreInit/results/PID$core.log
chmod 777 /shared/coreInit/results/PID$core.log
/shared/coreInit/init.sh $core < /dev/null >> /shared/coreInit/results/PID$core.log 2>&1
echo "%%%% Done for PID$core" >> /shared/coreInit/results/PID$core.log
This executes a script from the shared file system after each core has been booted. To reproduce my problem, it is now sufficient to insert some copy commands (for roughly 20MB of data) from /shared to the SCC Linux ramdisk, then start all 48 cores in parallel. For up to ~30 cores, this works fine and the files are copied correctly; once too many cores are started simultaneously (and thus producing I/O), the PCIe connection breaks.
The MCPC uses the recommended Intel DX58SO mainboard.
Has anybody experienced similar issues? Any help is appreciated.
although we don't have physical access to a complete SCC/MCPC system we have similar problems with our data center system.
We ran a simple, but extended RCCE ping pong application, where 24 core pairs communicate. Each process does some plain fprintf file I/O, after a number of measurement steps are done. We don't think that this causes a very high load of I/O operations. Anyway our MARC system crashes reproducible when running our application and needs a hard reset by MARC administrators. Due to the different time zones between the MARC DataCenter and our location (Germany) there is a delay of about one day for us to get back to work on SCC. A remote reset button would be a more sophisticated solution for that problem ;-)
Nevertheless, there should be an Intel-provided solution or at least a temporary workaround for that problem, since running applications without stable I/O doesn't make much fun.
I have run your code and yes, I can crash the SCC. Heavy duty I/O has had a reputation for crashing SCC systems, and this example clearly does a lot of printf's. As you pointed out, I had to write a script that called your program a number of times to see this effect.
All of the code on the SCC is opensource. If you see anything in any of the drivers that might cause this issue, please let us know. I filed a bug about this issue (Bug 86). Please move the discussion to our Bugzilla http://marcbug.scc-dc.com/bugzilla3/ If there is no solution forthcoming, we will add it to our list of errata.
I have found out another interesting thing about that problem. We first thought the problem is, that all cores write at the same time too much. But i have modified my programm that only one core do all the printf's and the other cores only send there values to core one, so he make the printf's instead of them. That version seems to keep up longer, but after a while it crashes anyway. So it seems that it doesn't madder how many cores do printfs. One core can crash the SCC as well.