When I try to run the RCCE application, sometimes it runs successfully but sometimes it hangs and never respond back.
The link for pssh has been created in the /usr/bin directory.
It is different from the bug#16 because I am using rccerun.
It is configured with SCC option. The stencil program has been compiled by make stencil_synch.
I have attached my stencil_synch run logs for a succesful run and two hanged runs.
The hang occurred randomly to every program. This error prevents me from running a successful stress test.
It will hang at different program run during the stress test.
I noticed from your log file that you are using RCCE 1.0.6. This is a very old version of RCCE. I would recommend using the trunk. The last release was 1.0.13. We'll have another release very shortly, 1.1. The only thing holding it up is that the emulator is broken in the trunk. It's not broken in a mysterious way. The fix is known, just not yet implemented.
I did run the stress test several times on one of our data center systems. And I ran stencil_synch also several times. I did not see any hangs. I attached some log files that show what I did.
I did notice that you have RCCE checked out under /shared. I don't see anything wrong with that, but it's not typical operation. Usually, people check out RCCE under their /home. They then copy the executable to their directory under /shared. One advantage of this method in the data center is that the /home directories are backed up; the /shared directories are not. However, you are running on your own MCPC/RockyLake. Again, I see nothing wrong with putting your stuff under /shared, but that and the older RCCE are the only differences I can see.
What level of stress test were you running? I tried -S (the small one) just because it's faster.
For the RCCE1.0.6, I tried strees test with -S small input. I think the hang is not related to the input size.
I have downloded the newest RCCE in the trunk.(Revision153)
I used ./configure SCC_LINUX and makeall.
In the existing apps, I made pingpong, stencil and stencil_synch.
The problem is that when I tried to run pingpong, it always hangs at the same line.
When I tried stencil , it failed with error code 139.
It was strange that I can run a succesful stencil_synch but not always. Most of the time, it hanged after the line
pssh -h PSSH_HOST_FILE.30606 -t -1 -P -p 2 /shared/rcce/apps/STENCIL/stencil_synch 2 0.533 00 01 < /dev/null
I attached my run log for each application.
MyRCCEtests_log.txt.zip 1.6 K
I downloaded the RCCE from the trunk and recompile it again. Then, I did reboot my BMC.
Now, I can run pingpong successfully. It just takes a long time to finish the run.
The stencil_synch never hangs again after I reboot my BMC.
I can only run with one core for stencil.
I have attached my successful run logs.
I cansuccessfully run stress test -S.
Your suggestion is very useful. Thank you.
MyRCCEtests_log.txt.zip 3.4 K
Rebooting solves a lot of problems. Sometimes the system just gets into a bad state and we take it down to the ground and bring it back up.
I did put rcce under /shared on one of our Intel systems and was unable to see a hang. I looked at your log file and issued the same commands as you did. I made some log files with the script command and attached them.
An earlier version of pingpong did take a long time, but I think the newer version does less and hence is faster. In any case one of my log files shows the timing for pingpong on an Intel system.
I don't see you doing anything different from what you did when it hung. Was it just the rebooting that helped?
I downloaded a new rcc source file, rebuilt it and rebooted the system.
Before rebooting the system, it gave me the same errors.
I think rebooting the system really helps recover from bad state.
After the reboot, I can run rcce programs successfully.
By the way, is there any way to turn off BMC and turn it on again without rebooting MCPC?
Normally, we have to remove crbif but I don't know how to reload it without rebooting the MCPC.
What do you mean by turning off power to the BMC? There are two power sources for the SCC unit. One goes to the chip and the other is main power to the board.
You can turn off the board by either logging into the BMC (telnetting) and issuing a power off command or by turning off the switch behind the dropdown front panel of the SCC unit. Sometimes I have seen the MCPC hang if I telnet to the BMC and issue a power off without removing (rmmod) crbif. This doesn't always occur, and hardly ever occurs in the data center. But I've seen it happen with standalone MCPC/SCC systems. The BMC is still running.
You can turn off main power to the board by switching off the switch at the back of the SCC unit. This would turn off the BMC.
If you've removed crbif, you have to reboot the MCPC to get it back. If you have turned off the SCC unit without removing crbif, I think you can just turn it back on without rebooting the MCPC and be OK. I think this is true even if you turn off both the chip and the board.
What I haven't tried is turning off the board while leaving the chip powered on. We have some apocrypha that this is a bad thing to do ... that it might actually cause some damage, but no evidence to support this.
Some of our remote users have access to a web power switch. With that switch, they can turn off power to the MCPC and the SCC unit separately. This is a hard power down ... like pulling out the power cord on each of the systems (MCPC and SCC). When they turn off power to the SCC (that's what I referred to above as main power to the board), we recommned that they power off the chip first (the BMC power-off command).
As an additional point (not power related), I think that you could actually unplug the eth1 cable and the system would still be operational ... in the sense of training, booting , and running core programs. Without eth1 connected, you cannot telnet to the BMC though.