1 2 Previous Next 16 Replies Latest reply: May 30, 2011 10:38 AM by Vasileios Trigonakis RSS

RCCE Communication gets stuck

Vasileios Trigonakis Community Member
Currently Being Moderated

For a long time, me and the colleagues working on the marc026 SCC have the following problems:

 

  • The barriers (RCCE_barrier) are not always passed.
  • The communication may get stuck (both with RCCE and iRCCE libraries).

 

These problems appear when trying to run a program on more than 8 cores. There were times that 12 or even 24 cores ran properly.

 

I am using the attached application to test how things work (a ring of nodes and a token sent over the ring). I have run tests up to 1 billion token-hops upto 8 cores. More than 8 (or 12) cores it stops.  For example, today on 4 cores:

 

rck00: [0] Started. The token will be resubmitted 100000000 times!
rck00: [0] Token here: 1003
rck00: [0] Token here: 2007
rck00: [0] Token here: 4015
rck00: [0] Token here: 8031
rck00: [0] Token here: 16063
rck00: [0] Token here: 32127
rck00: [0] Token here: 64255
rck00: [0] Token here: 128511
rck00: [0] Token here: 257023
rck00: [0] Token here: 514047
rck00: [0] Token here: 1028095
rck00: [0] Token here: 2056191
rck00: [0] Token here: 4112383
rck00: [0] Token here: 8224767
rck00: [0] Token here: 16449535
rck00: [0] Token here: 32899071
rck00: [0] Token here: 65798143
rck01: [1] ~~Completed. Token here: 100000000

 

while on 12:

 

rck00: [0] Started. The token will be resubmitted 100000000 times!
rck00: [0] Token here: 1007
rck00: [0] Token here: 2015
rck00: [0] Token here: 4031
rck00: [0] Token here: 8063
rck00: [0] Token here: 16127
rck00: [0] Token here: 32255
rck00: [0] Token here: 64511

 

it got stuck at this point.

 

The analysis I did showed that it should be a flag synchronization problem.

 

Did/Does anyone else have the same problem or is it just our SCC?

 

Thanks,

Vasileios

  • 1. Re: RCCE Communication gets stuck
    Apotheosis Community Member
    Currently Being Moderated

    I am finding the same thing. Sometimes I can manage to get 12/16 cores working, however afterwards it'll just lock and only sccPowercycle -r (with sccBmc -i) will fix the issue (whereby it'll lock again afterwards).

     

    I am running the linux_2.6.38.3 iso however, and compiling the programs with the i586 compiler.

     

    I wonder if these issues are related. I shall download your code and see if the issue is similar.

     

    EDIT: Yes, your code locks up here. Unfortunately not entirely sure what the root cause is, this started after we upgraded to 1.4.0 scckit and also started to use the new linux image and new compiler.

  • 2. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    When you say new linux image, do you mean the one that came with sccKit 1.4.0 or the new beta image that is on our SVN? That beta image is still very preliminary. Do you see this issue with the default linux that canme with sccKit 1.4.0?

  • 3. Re: RCCE Communication gets stuck
    Vasileios Trigonakis Community Member
    Currently Being Moderated

    I use the default image from sccKit 1.4.0. Prior to the 1.4.0 (1.3.0) I was facing problems if using different than the 0 (533/800/800MHz) setting on the SCC. Still the SCC "acts" worse for the settings other than the 0.

  • 4. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    Is this a Bugzilla bug also? Do you have a bug number?

    What version of RCCE are you using? The trunk? DId you compile with icc? What were the PLATFORM_FLAGS?

  • 5. Re: RCCE Communication gets stuck
    Vasileios Trigonakis Community Member
    Currently Being Moderated

    Ted Kubaska wrote:

     

    Is this a Bugzilla bug also? Do you have a bug number?

    What version of RCCE are you using? The trunk? DId you compile with icc? What were the PLATFORM_FLAGS?

    No, I haven't reported it in Bugzilla yet. I wanted to ensure that something is indeed wrong.

     

     

    I am using the RCCE from the trunk and I used icc with

     

    PLATFORMFLAGS=$(BMFLAG) -DSCC -DSHMADD -static -mcpu=pentium -gcc-version=340 -I../include.

     

     

    I just tested it with the tag RCCE_V1.0.13. Same behaviour.

  • 6. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    Thanks. Does your app actually use shared memory? If not, you could try running without -DSHMADD to see if that shared memory addition is causing trouble. I wouldn't recommend the tagged RCCE; the trunk is best. Meanwhile I'll try running it here on a known good system and see if I see the same problem.

     

    I doubt this is a hw issue. But running the app on known good hw is a test for this.

  • 7. Re: RCCE Communication gets stuck
    Vasileios Trigonakis Community Member
    Currently Being Moderated

    I tried it without the SHMADD flag. Same behaviour.

  • 8. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    Thanks. Are you running on your own hw or are you using a marc system?

     

    When you run RCCE with -DSHMADD, LUT values get modified. Those modifications are going to stay there if later you run without -DSHMADD. You're not allocating the expanded shared memory but the LUT changes remain. You have to reset the SCC and reboot Linux to remove the LUT changes. I doubt very much that these LUT changes have anything to do with your issue.

     

    Did you say this already? ... sorry if I forgot. You are running 1.4.0. Did you see this problem with 1.3.0?

  • 9. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    Oh, marc026 ... sorry I missed that

  • 10. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    Hmmm ... this is interesting. I cannot get this to fail on a 1.3.0 system. Ran on 24 cores with 100000000. Would you expect it to fail in the first few invocations? Does it sometime run and sometimes fail? Would you expect it to fail on 24 cores?

     

    tekubasx@marc042:/shared/tekubasx/RING$ rccerun -nue 24 -f rc.hosts ringsync
    pssh -h PSSH_HOST_FILE.18163 -t -1 -p 24 /shared/tekubasx/RING/mpb.18163 < /dev/null
    [1] 11:40:06 [SUCCESS] rck13
    [2] 11:40:06 [SUCCESS] rck06
    [3] 11:40:06 [SUCCESS] rck11
    [4] 11:40:06 [SUCCESS] rck23
    [5] 11:40:06 [SUCCESS] rck00
    [6] 11:40:06 [SUCCESS] rck03
    [7] 11:40:06 [SUCCESS] rck09
    [8] 11:40:06 [SUCCESS] rck10
    [9] 11:40:06 [SUCCESS] rck14
    [10] 11:40:06 [SUCCESS] rck08
    [11] 11:40:06 [SUCCESS] rck17
    [12] 11:40:06 [SUCCESS] rck02
    [13] 11:40:06 [SUCCESS] rck05
    [14] 11:40:06 [SUCCESS] rck22
    [15] 11:40:06 [SUCCESS] rck07
    [16] 11:40:06 [SUCCESS] rck12
    [17] 11:40:06 [SUCCESS] rck16
    [18] 11:40:06 [SUCCESS] rck18
    [19] 11:40:06 [SUCCESS] rck19
    [20] 11:40:06 [SUCCESS] rck20
    [21] 11:40:06 [SUCCESS] rck21
    [22] 11:40:06 [SUCCESS] rck01
    [23] 11:40:06 [SUCCESS] rck04
    [24] 11:40:06 [SUCCESS] rck15
    pssh -h PSSH_HOST_FILE.18163 -t -1 -P -p 24 /shared/tekubasx/RING/ringsync 24 0.533 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 < /dev/null
    rck00: [0] Started. The token will be resubmitted 100000000 times!
    rck00: [0] Token here: 1007
    [0] Token here: 2015
    rck00: [0] Token here: 4031
    rck00: [0] Token here: 8063
    rck00: [0] Token here: 16127
    rck00: [0] Token here: 32255
    rck00: [0] Token here: 64511
    rck00: [0] Token here: 129023
    rck00: [0] Token here: 258047
    rck00: [0] Token here: 516095
    rck00: [0] Token here: 1032191
    rck00: [0] Token here: 2064383
    rck00: [0] Token here: 4128767
    rck00: [0] Token here: 8257535
    rck00: [0] Token here: 16515071
    rck00: [0] Token here: 33030143
    rck00: [0] Token here: 66060287
    rck17: [17] ~~Completed. Token here: 100000000
    [1] 11:43:01 [SUCCESS] rck01
    [2] 11:43:01 [SUCCESS] rck00
    [3] 11:43:01 [SUCCESS] rck02
    [4] 11:43:01 [SUCCESS] rck03
    [5] 11:43:01 [SUCCESS] rck04
    [6] 11:43:01 [SUCCESS] rck05
    [7] 11:43:01 [SUCCESS] rck06
    [8] 11:43:01 [SUCCESS] rck07
    [9] 11:43:01 [SUCCESS] rck08
    [10] 11:43:01 [SUCCESS] rck09
    [11] 11:43:01 [SUCCESS] rck10
    [12] 11:43:01 [SUCCESS] rck11
    [13] 11:43:01 [SUCCESS] rck12
    [14] 11:43:01 [SUCCESS] rck13
    [15] 11:43:01 [SUCCESS] rck14
    [16] 11:43:01 [SUCCESS] rck15
    [17] 11:43:01 [SUCCESS] rck16
    [18] 11:43:01 [SUCCESS] rck17
    [19] 11:43:01 [SUCCESS] rck18
    [20] 11:43:01 [SUCCESS] rck19
    [21] 11:43:01 [SUCCESS] rck20
    [22] 11:43:01 [SUCCESS] rck21
    [23] 11:43:01 [SUCCESS] rck22
    [24] 11:43:01 [SUCCESS] rck23
    tekubasx@marc042:/shared/tekubasx/RING$

  • 11. Re: RCCE Communication gets stuck
    Vasileios Trigonakis Community Member
    Currently Being Moderated

    On 24 cores I saw it running properly only for two or three hours about 10 days ago. Usually it does not even reach the 1000 msgs.

     

    Now that I tried a build without the SHMADD (after a reset and boot) it ran properly 4 or 5 times but then stopped and gets stuck at around 30 - 120K messages (I tried re-reseting it, but nothing changed).

     

    With sccKit 1.3.0 things were usually running ok, but only on the 533/800/800 MHz setting.

  • 12. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    I started a

        nohup doit.sh &

    on marc101 (1.4.0 with Tile533_Mesh800_DDR800) just so that I could see it fail. I guess what you're saying is that it starts working OK and then after repeatedly running the app, it locks up. These kinds of issues are notoriously hard to debug.

     

    The code looks pretty straightforward.  Do you have any speculation about why it locks up?

     

    doit.sh looks like

     

    #!/bin/bash
    date
    for i in {1..10}
    do
            echo "RUN = $i"
            rccerun -nue 24 -f rc.hosts ringsync
    done
    date

     

    I filed a bug (232) http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=232

  • 13. Re: RCCE Communication gets stuck
    Ted Kubaska Community Member
    Currently Being Moderated

    Well, it ran 10 times without failing on marc101. I believe you are seeing the lockup but have not been able to reproduce it yet.

  • 14. Re: RCCE Communication gets stuck
    Vasileios Trigonakis Community Member
    Currently Being Moderated

    Ted Kubaska wrote:

    I guess what you're saying is that it starts working OK and then after repeatedly running the app, it locks up.

     

    No no. What I am saying is that it alsmost never ran properly on more than 12 cores after the 1.4.0 update.

     

    Did you try running it on all 48 cores?

     

    Ted Kubaska wrote:

    Do you have any speculation about why it locks up?

     

    It seems to be a flag synchronization problem. I will post a more detailed explanation as soon as possible.

1 2 Previous Next

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points