we are again having some trouble with our SCC (RockyLake) setup. I can successfully start SCC Linux on all 48 cores, but the PCIe link becomes unresponsive if too many cores are performing too much I/O to the /shared file system at the same time.
Once in this state, sccGui reports timeouts during the DMA transfers. Sometimes, it logs a single line that an unexpected packet had been received (with varying command codes), but this message does not appear all the time. The crbif driver logs error messages to the text-mode console: MIP/MOP status 0x20000000, unable to start DMA transfer.
The SCC performance meter no longer displays the current CPU load; the graphs are advancing, but only plot the same (last-known) value. Also, network traffic to the SCC is no longer possible, and "ping rck**" times out.
Occasionally, after some time, I have seen the screen of the MCPC go black (as if the X server was shut down), and the system then locking up completely. If this is the case, I cannot even reboot the MCPC without a hard power cycle on the SCC first; it displays the BIOS screen, but Linux never comes up. It just sits at a black screen. If the MCPC is still responsive, I can call sccPowercycle, but it usually (roughly 9 out of 10 tries) does not fix the problem; even re-training the SIF fails, until the SCC has been hard-powercycled.
To generate I/O load, we use a recompiled SCC Linux and some scripts. The Linux code is identical to the one in the public repository, with one change: in rckos/fs/etc/init.d/tc-config, I appended the following:
# HPI: Additional init script
if [ -f /shared/coreInit/init.sh ]
echo "%%%% Invoking /shared/coreInit/init.sh for PID$core" > /shared/coreInit/results/PID$core.log
chmod 777 /shared/coreInit/results/PID$core.log
/shared/coreInit/init.sh $core < /dev/null >> /shared/coreInit/results/PID$core.log 2>&1
echo "%%%% Done for PID$core" >> /shared/coreInit/results/PID$core.log
This executes a script from the shared file system after each core has been booted. To reproduce my problem, it is now sufficient to insert some copy commands (for roughly 20MB of data) from /shared to the SCC Linux ramdisk, then start all 48 cores in parallel. For up to ~30 cores, this works fine and the files are copied correctly; once too many cores are started simultaneously (and thus producing I/O), the PCIe connection breaks.
The MCPC uses the recommended Intel DX58SO mainboard.
Has anybody experienced similar issues? Any help is appreciated.