1 2 Previous Next 16 Replies Latest reply on May 30, 2011 10:38 AM by saibbot

    RCCE Communication gets stuck

    saibbot

      For a long time, me and the colleagues working on the marc026 SCC have the following problems:

       

      • The barriers (RCCE_barrier) are not always passed.
      • The communication may get stuck (both with RCCE and iRCCE libraries).

       

      These problems appear when trying to run a program on more than 8 cores. There were times that 12 or even 24 cores ran properly.

       

      I am using the attached application to test how things work (a ring of nodes and a token sent over the ring). I have run tests up to 1 billion token-hops upto 8 cores. More than 8 (or 12) cores it stops.  For example, today on 4 cores:

       

      rck00: [0] Started. The token will be resubmitted 100000000 times!
      rck00: [0] Token here: 1003
      rck00: [0] Token here: 2007
      rck00: [0] Token here: 4015
      rck00: [0] Token here: 8031
      rck00: [0] Token here: 16063
      rck00: [0] Token here: 32127
      rck00: [0] Token here: 64255
      rck00: [0] Token here: 128511
      rck00: [0] Token here: 257023
      rck00: [0] Token here: 514047
      rck00: [0] Token here: 1028095
      rck00: [0] Token here: 2056191
      rck00: [0] Token here: 4112383
      rck00: [0] Token here: 8224767
      rck00: [0] Token here: 16449535
      rck00: [0] Token here: 32899071
      rck00: [0] Token here: 65798143
      rck01: [1] ~~Completed. Token here: 100000000
      

       

      while on 12:

       

      rck00: [0] Started. The token will be resubmitted 100000000 times!
      rck00: [0] Token here: 1007
      rck00: [0] Token here: 2015
      rck00: [0] Token here: 4031
      rck00: [0] Token here: 8063
      rck00: [0] Token here: 16127
      rck00: [0] Token here: 32255
      rck00: [0] Token here: 64511
      

       

      it got stuck at this point.

       

      The analysis I did showed that it should be a flag synchronization problem.

       

      Did/Does anyone else have the same problem or is it just our SCC?

       

      Thanks,

      Vasileios

        • 1. Re: RCCE Communication gets stuck
          Apotheosis

          I am finding the same thing. Sometimes I can manage to get 12/16 cores working, however afterwards it'll just lock and only sccPowercycle -r (with sccBmc -i) will fix the issue (whereby it'll lock again afterwards).

           

          I am running the linux_2.6.38.3 iso however, and compiling the programs with the i586 compiler.

           

          I wonder if these issues are related. I shall download your code and see if the issue is similar.

           

          EDIT: Yes, your code locks up here. Unfortunately not entirely sure what the root cause is, this started after we upgraded to 1.4.0 scckit and also started to use the new linux image and new compiler.

          • 2. Re: RCCE Communication gets stuck
            tedk

            When you say new linux image, do you mean the one that came with sccKit 1.4.0 or the new beta image that is on our SVN? That beta image is still very preliminary. Do you see this issue with the default linux that canme with sccKit 1.4.0?

            • 3. Re: RCCE Communication gets stuck
              saibbot

              I use the default image from sccKit 1.4.0. Prior to the 1.4.0 (1.3.0) I was facing problems if using different than the 0 (533/800/800MHz) setting on the SCC. Still the SCC "acts" worse for the settings other than the 0.

              • 4. Re: RCCE Communication gets stuck
                tedk

                Is this a Bugzilla bug also? Do you have a bug number?

                What version of RCCE are you using? The trunk? DId you compile with icc? What were the PLATFORM_FLAGS?

                • 5. Re: RCCE Communication gets stuck
                  saibbot

                  Ted Kubaska wrote:

                   

                  Is this a Bugzilla bug also? Do you have a bug number?

                  What version of RCCE are you using? The trunk? DId you compile with icc? What were the PLATFORM_FLAGS?

                  No, I haven't reported it in Bugzilla yet. I wanted to ensure that something is indeed wrong.

                   

                   

                  I am using the RCCE from the trunk and I used icc with

                   

                  PLATFORMFLAGS=$(BMFLAG) -DSCC -DSHMADD -static -mcpu=pentium -gcc-version=340 -I../include.

                   

                   

                  I just tested it with the tag RCCE_V1.0.13. Same behaviour.

                  • 6. Re: RCCE Communication gets stuck
                    tedk

                    Thanks. Does your app actually use shared memory? If not, you could try running without -DSHMADD to see if that shared memory addition is causing trouble. I wouldn't recommend the tagged RCCE; the trunk is best. Meanwhile I'll try running it here on a known good system and see if I see the same problem.

                     

                    I doubt this is a hw issue. But running the app on known good hw is a test for this.

                    • 7. Re: RCCE Communication gets stuck
                      saibbot

                      I tried it without the SHMADD flag. Same behaviour.

                      • 8. Re: RCCE Communication gets stuck
                        tedk

                        Thanks. Are you running on your own hw or are you using a marc system?

                         

                        When you run RCCE with -DSHMADD, LUT values get modified. Those modifications are going to stay there if later you run without -DSHMADD. You're not allocating the expanded shared memory but the LUT changes remain. You have to reset the SCC and reboot Linux to remove the LUT changes. I doubt very much that these LUT changes have anything to do with your issue.

                         

                        Did you say this already? ... sorry if I forgot. You are running 1.4.0. Did you see this problem with 1.3.0?

                        • 9. Re: RCCE Communication gets stuck
                          tedk

                          Oh, marc026 ... sorry I missed that

                          • 10. Re: RCCE Communication gets stuck
                            tedk

                            Hmmm ... this is interesting. I cannot get this to fail on a 1.3.0 system. Ran on 24 cores with 100000000. Would you expect it to fail in the first few invocations? Does it sometime run and sometimes fail? Would you expect it to fail on 24 cores?

                             

                            tekubasx@marc042:/shared/tekubasx/RING$ rccerun -nue 24 -f rc.hosts ringsync
                            pssh -h PSSH_HOST_FILE.18163 -t -1 -p 24 /shared/tekubasx/RING/mpb.18163 < /dev/null
                            [1] 11:40:06 [SUCCESS] rck13
                            [2] 11:40:06 [SUCCESS] rck06
                            [3] 11:40:06 [SUCCESS] rck11
                            [4] 11:40:06 [SUCCESS] rck23
                            [5] 11:40:06 [SUCCESS] rck00
                            [6] 11:40:06 [SUCCESS] rck03
                            [7] 11:40:06 [SUCCESS] rck09
                            [8] 11:40:06 [SUCCESS] rck10
                            [9] 11:40:06 [SUCCESS] rck14
                            [10] 11:40:06 [SUCCESS] rck08
                            [11] 11:40:06 [SUCCESS] rck17
                            [12] 11:40:06 [SUCCESS] rck02
                            [13] 11:40:06 [SUCCESS] rck05
                            [14] 11:40:06 [SUCCESS] rck22
                            [15] 11:40:06 [SUCCESS] rck07
                            [16] 11:40:06 [SUCCESS] rck12
                            [17] 11:40:06 [SUCCESS] rck16
                            [18] 11:40:06 [SUCCESS] rck18
                            [19] 11:40:06 [SUCCESS] rck19
                            [20] 11:40:06 [SUCCESS] rck20
                            [21] 11:40:06 [SUCCESS] rck21
                            [22] 11:40:06 [SUCCESS] rck01
                            [23] 11:40:06 [SUCCESS] rck04
                            [24] 11:40:06 [SUCCESS] rck15
                            pssh -h PSSH_HOST_FILE.18163 -t -1 -P -p 24 /shared/tekubasx/RING/ringsync 24 0.533 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 < /dev/null
                            rck00: [0] Started. The token will be resubmitted 100000000 times!
                            rck00: [0] Token here: 1007
                            [0] Token here: 2015
                            rck00: [0] Token here: 4031
                            rck00: [0] Token here: 8063
                            rck00: [0] Token here: 16127
                            rck00: [0] Token here: 32255
                            rck00: [0] Token here: 64511
                            rck00: [0] Token here: 129023
                            rck00: [0] Token here: 258047
                            rck00: [0] Token here: 516095
                            rck00: [0] Token here: 1032191
                            rck00: [0] Token here: 2064383
                            rck00: [0] Token here: 4128767
                            rck00: [0] Token here: 8257535
                            rck00: [0] Token here: 16515071
                            rck00: [0] Token here: 33030143
                            rck00: [0] Token here: 66060287
                            rck17: [17] ~~Completed. Token here: 100000000
                            [1] 11:43:01 [SUCCESS] rck01
                            [2] 11:43:01 [SUCCESS] rck00
                            [3] 11:43:01 [SUCCESS] rck02
                            [4] 11:43:01 [SUCCESS] rck03
                            [5] 11:43:01 [SUCCESS] rck04
                            [6] 11:43:01 [SUCCESS] rck05
                            [7] 11:43:01 [SUCCESS] rck06
                            [8] 11:43:01 [SUCCESS] rck07
                            [9] 11:43:01 [SUCCESS] rck08
                            [10] 11:43:01 [SUCCESS] rck09
                            [11] 11:43:01 [SUCCESS] rck10
                            [12] 11:43:01 [SUCCESS] rck11
                            [13] 11:43:01 [SUCCESS] rck12
                            [14] 11:43:01 [SUCCESS] rck13
                            [15] 11:43:01 [SUCCESS] rck14
                            [16] 11:43:01 [SUCCESS] rck15
                            [17] 11:43:01 [SUCCESS] rck16
                            [18] 11:43:01 [SUCCESS] rck17
                            [19] 11:43:01 [SUCCESS] rck18
                            [20] 11:43:01 [SUCCESS] rck19
                            [21] 11:43:01 [SUCCESS] rck20
                            [22] 11:43:01 [SUCCESS] rck21
                            [23] 11:43:01 [SUCCESS] rck22
                            [24] 11:43:01 [SUCCESS] rck23
                            tekubasx@marc042:/shared/tekubasx/RING$

                            • 11. Re: RCCE Communication gets stuck
                              saibbot

                              On 24 cores I saw it running properly only for two or three hours about 10 days ago. Usually it does not even reach the 1000 msgs.

                               

                              Now that I tried a build without the SHMADD (after a reset and boot) it ran properly 4 or 5 times but then stopped and gets stuck at around 30 - 120K messages (I tried re-reseting it, but nothing changed).

                               

                              With sccKit 1.3.0 things were usually running ok, but only on the 533/800/800 MHz setting.

                              • 12. Re: RCCE Communication gets stuck
                                tedk

                                I started a

                                    nohup doit.sh &

                                on marc101 (1.4.0 with Tile533_Mesh800_DDR800) just so that I could see it fail. I guess what you're saying is that it starts working OK and then after repeatedly running the app, it locks up. These kinds of issues are notoriously hard to debug.

                                 

                                The code looks pretty straightforward.  Do you have any speculation about why it locks up?

                                 

                                doit.sh looks like

                                 

                                #!/bin/bash
                                date
                                for i in {1..10}
                                do
                                        echo "RUN = $i"
                                        rccerun -nue 24 -f rc.hosts ringsync
                                done
                                date

                                 

                                I filed a bug (232) http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=232

                                • 13. Re: RCCE Communication gets stuck
                                  tedk

                                  Well, it ran 10 times without failing on marc101. I believe you are seeing the lockup but have not been able to reproduce it yet.

                                  • 14. Re: RCCE Communication gets stuck
                                    saibbot

                                    Ted Kubaska wrote:

                                    I guess what you're saying is that it starts working OK and then after repeatedly running the app, it locks up.

                                     

                                    No no. What I am saying is that it alsmost never ran properly on more than 12 cores after the 1.4.0 update.

                                     

                                    Did you try running it on all 48 cores?

                                     

                                    Ted Kubaska wrote:

                                    Do you have any speculation about why it locks up?

                                     

                                    It seems to be a flag synchronization problem. I will post a more detailed explanation as soon as possible.

                                    1 2 Previous Next