7 Replies Latest reply on Feb 8, 2011 3:14 PM by tedk

    Does your RCCE program hang?

    tedk

      Please check out bug 16 http://http://marcbug.scc-dc.com/bugzilla3

       

      It is important to clear the message padding buffer before running a RCCE program. the program rccerun will do this for you. The preferred way to run a RCCE program is to use rccerun.

       

      Note that rccerun uses pssh. When you install pssh on an Ubuntu system, it is called parallel-ssh in /usr/bin. For rccerun to work, you must make a link called pssh. to parallel-ssh. Do this as root.

      cd /usr/bin

      ln -s parallel-ssh pssh

        • 1. Re: Does your RCCE program hang?
          pollawat

          Hi,

           

          When I try to run the RCCE application, sometimes it runs successfully but sometimes it hangs and never respond back.

          The link for pssh has been created in the /usr/bin directory.

          It is different from the bug#16 because I am using rccerun.

          It is configured with SCC option. The stencil program has been compiled by make stencil_synch.

          I have attached my stencil_synch run logs for a succesful run and two hanged runs.

          The hang occurred randomly to every program. This error prevents me from running a successful stress test.

          It will hang at different program run during the stress test.

          • 2. Re: Does your RCCE program hang?
            tedk

            I noticed from your log file that you are using RCCE 1.0.6. This is a very old version of RCCE. I would recommend using the trunk. The last release was 1.0.13. We'll have another release very shortly, 1.1. The only thing holding it up is that the emulator is broken in the trunk. It's not broken in a mysterious way. The fix is known, just not yet implemented.

             

            I did run the stress test several times on one of our data center systems. And I ran stencil_synch also several times. I did not see any hangs. I attached some log files that show what I did.

             

            I did notice that you have RCCE checked out under /shared.  I don't see anything wrong with that, but it's not typical operation. Usually, people check out RCCE under their /home. They then copy the executable to their directory under /shared. One advantage of this method in the data center is that the /home directories are backed up; the /shared directories are not. However, you are running on your own MCPC/RockyLake. Again, I see nothing wrong with putting your stuff under /shared, but that and the older RCCE are the only differences I can see.

             

            What level of stress test were you running? I tried -S (the small one) just because it's faster.

            • 3. Re: Does your RCCE program hang?
              pollawat

              For the RCCE1.0.6, I tried strees test with -S small input. I think  the hang is not related to the input size.

               

              I have downloded the newest RCCE in the trunk.(Revision153)

              I used ./configure SCC_LINUX and  makeall.

              In the existing apps, I made pingpong, stencil and stencil_synch.

              The problem is that when I tried to run pingpong, it always hangs at the same line.

              When I tried stencil , it failed with error code 139.

              It was strange that I can run a succesful stencil_synch but not always. Most of the time, it hanged  after the line

              pssh -h PSSH_HOST_FILE.30606 -t -1 -P -p 2 /shared/rcce/apps/STENCIL/stencil_synch 2 0.533 00 01 < /dev/null

              I attached my run log for each application.

              • 4. Re: Does your RCCE program hang?
                pollawat

                I downloaded the RCCE from the trunk and recompile it again. Then, I did reboot my BMC.

                Now, I can run pingpong successfully. It just takes a long time to finish the run.

                The stencil_synch  never hangs again after I reboot my BMC.

                I can only run with one core for stencil.

                I have attached my successful run logs.

                I cansuccessfully run stress test -S.

                Your suggestion is very useful. Thank you.

                • 5. Re: Does your RCCE program hang?
                  tedk

                  Rebooting solves a lot of problems. Sometimes the system just gets into a bad state and we take it down to the ground and bring it back up.

                   

                  I did put rcce under /shared on one of our Intel systems and was unable to see a hang. I looked at your log file and issued the same commands as you did. I made some log files with the script command and attached them.

                   

                  An earlier version of pingpong did take a long time, but I think the newer version does less and hence is faster. In any case one of my log files shows the timing for pingpong on an Intel system.

                   

                  I don't see you doing anything different from what you did when it hung. Was it just the rebooting that helped?

                  • 6. Re: Does your RCCE program hang?
                    pollawat

                    I downloaded a new rcc  source file, rebuilt it and rebooted the system.

                    Before rebooting the system, it gave me the same errors.

                    I think rebooting the system  really helps recover from bad state.

                    After the reboot, I can run rcce programs successfully.

                    By the way, is there any way to turn off BMC and turn it on again without rebooting MCPC?

                    Normally, we have to remove crbif but I don't know how to reload it without rebooting the MCPC.

                    • 7. Re: Does your RCCE program hang?
                      tedk

                      What do you mean by turning off power to the BMC? There are two power sources for the SCC unit. One goes to the chip and the other is main power to the board.

                       

                      You can turn off the board by either logging into the BMC (telnetting) and issuing a power off command or by turning off the switch behind the dropdown front panel of the SCC unit. Sometimes I have seen the MCPC hang if I telnet to the BMC and issue a power off without removing (rmmod) crbif. This doesn't always occur, and hardly ever occurs in the data center. But I've seen it happen with standalone MCPC/SCC systems. The BMC is still running.

                       

                      You can turn off main power to the board by switching off the switch at the back of the SCC unit. This would turn off the BMC.

                       

                      If you've removed crbif, you have to reboot the MCPC to get it back. If you have turned off the SCC unit without removing crbif, I think you can just turn it back on without rebooting the MCPC and be OK. I think this is true even if you turn off both the chip and the board.

                       

                      What I haven't tried is turning off the board while leaving the chip powered on. We have some apocrypha that this is a bad thing to do ... that it might actually cause some damage, but no evidence to support this.

                       

                      Some of our remote users have access to a web power switch. With that switch, they can turn off power to the MCPC and the SCC unit separately. This is a hard power down ... like pulling out the power cord on each of the systems (MCPC and SCC). When they turn off power to the SCC (that's what I referred to above as main power to the board), we recommned that they power off the chip first (the BMC power-off command).

                       

                      As an additional point (not power related), I think that you could actually unplug the eth1 cable and the system would still be operational ... in the sense of training, booting , and running core programs. Without eth1 connected, you cannot telnet to the BMC though.