For a long time, me and the colleagues working on the marc026 SCC have the following problems:
RCCE_barrier) are not always passed.RCCE and iRCCE libraries).
These problems appear when trying to run a program on more than 8 cores. There were times that 12 or even 24 cores ran properly.
I am using the attached application to test how things work (a ring of nodes and a token sent over the ring). I have run tests up to 1 billion token-hops upto 8 cores. More than 8 (or 12) cores it stops. For example, today on 4 cores:
rck00: [0] Started. The token will be resubmitted 100000000 times! rck00: [0] Token here: 1003 rck00: [0] Token here: 2007 rck00: [0] Token here: 4015 rck00: [0] Token here: 8031 rck00: [0] Token here: 16063 rck00: [0] Token here: 32127 rck00: [0] Token here: 64255 rck00: [0] Token here: 128511 rck00: [0] Token here: 257023 rck00: [0] Token here: 514047 rck00: [0] Token here: 1028095 rck00: [0] Token here: 2056191 rck00: [0] Token here: 4112383 rck00: [0] Token here: 8224767 rck00: [0] Token here: 16449535 rck00: [0] Token here: 32899071 rck00: [0] Token here: 65798143 rck01: [1] ~~Completed. Token here: 100000000
while on 12:
rck00: [0] Started. The token will be resubmitted 100000000 times! rck00: [0] Token here: 1007 rck00: [0] Token here: 2015 rck00: [0] Token here: 4031 rck00: [0] Token here: 8063 rck00: [0] Token here: 16127 rck00: [0] Token here: 32255 rck00: [0] Token here: 64511
it got stuck at this point.
The analysis I did showed that it should be a flag synchronization problem.
Did/Does anyone else have the same problem or is it just our SCC?
Thanks,
Vasileios
I am finding the same thing. Sometimes I can manage to get 12/16 cores working, however afterwards it'll just lock and only sccPowercycle -r (with sccBmc -i) will fix the issue (whereby it'll lock again afterwards).
I am running the linux_2.6.38.3 iso however, and compiling the programs with the i586 compiler.
I wonder if these issues are related. I shall download your code and see if the issue is similar.
EDIT: Yes, your code locks up here. Unfortunately not entirely sure what the root cause is, this started after we upgraded to 1.4.0 scckit and also started to use the new linux image and new compiler.
When you say new linux image, do you mean the one that came with sccKit 1.4.0 or the new beta image that is on our SVN? That beta image is still very preliminary. Do you see this issue with the default linux that canme with sccKit 1.4.0?
I use the default image from sccKit 1.4.0. Prior to the 1.4.0 (1.3.0) I was facing problems if using different than the 0 (533/800/800MHz) setting on the SCC. Still the SCC "acts" worse for the settings other than the 0.
Is this a Bugzilla bug also? Do you have a bug number?
What version of RCCE are you using? The trunk? DId you compile with icc? What were the PLATFORM_FLAGS?
Ted Kubaska wrote:
Is this a Bugzilla bug also? Do you have a bug number?
What version of RCCE are you using? The trunk? DId you compile with icc? What were the PLATFORM_FLAGS?
No, I haven't reported it in Bugzilla yet. I wanted to ensure that something is indeed wrong.
I am using the RCCE from the trunk and I used icc with
PLATFORMFLAGS=$(BMFLAG) -DSCC -DSHMADD -static -mcpu=pentium -gcc-version=340 -I../include.
I just tested it with the tag RCCE_V1.0.13. Same behaviour.
Thanks. Does your app actually use shared memory? If not, you could try running without -DSHMADD to see if that shared memory addition is causing trouble. I wouldn't recommend the tagged RCCE; the trunk is best. Meanwhile I'll try running it here on a known good system and see if I see the same problem.
I doubt this is a hw issue. But running the app on known good hw is a test for this.
I tried it without the SHMADD flag. Same behaviour.
Thanks. Are you running on your own hw or are you using a marc system?
When you run RCCE with -DSHMADD, LUT values get modified. Those modifications are going to stay there if later you run without -DSHMADD. You're not allocating the expanded shared memory but the LUT changes remain. You have to reset the SCC and reboot Linux to remove the LUT changes. I doubt very much that these LUT changes have anything to do with your issue.
Did you say this already? ... sorry if I forgot. You are running 1.4.0. Did you see this problem with 1.3.0?
Oh, marc026 ... sorry I missed that
Hmmm ... this is interesting. I cannot get this to fail on a 1.3.0 system. Ran on 24 cores with 100000000. Would you expect it to fail in the first few invocations? Does it sometime run and sometimes fail? Would you expect it to fail on 24 cores?
tekubasx@marc042:/shared/tekubasx/RING$ rccerun -nue 24 -f rc.hosts ringsync
pssh -h PSSH_HOST_FILE.18163 -t -1 -p 24 /shared/tekubasx/RING/mpb.18163 < /dev/null
[1] 11:40:06 [SUCCESS] rck13
[2] 11:40:06 [SUCCESS] rck06
[3] 11:40:06 [SUCCESS] rck11
[4] 11:40:06 [SUCCESS] rck23
[5] 11:40:06 [SUCCESS] rck00
[6] 11:40:06 [SUCCESS] rck03
[7] 11:40:06 [SUCCESS] rck09
[8] 11:40:06 [SUCCESS] rck10
[9] 11:40:06 [SUCCESS] rck14
[10] 11:40:06 [SUCCESS] rck08
[11] 11:40:06 [SUCCESS] rck17
[12] 11:40:06 [SUCCESS] rck02
[13] 11:40:06 [SUCCESS] rck05
[14] 11:40:06 [SUCCESS] rck22
[15] 11:40:06 [SUCCESS] rck07
[16] 11:40:06 [SUCCESS] rck12
[17] 11:40:06 [SUCCESS] rck16
[18] 11:40:06 [SUCCESS] rck18
[19] 11:40:06 [SUCCESS] rck19
[20] 11:40:06 [SUCCESS] rck20
[21] 11:40:06 [SUCCESS] rck21
[22] 11:40:06 [SUCCESS] rck01
[23] 11:40:06 [SUCCESS] rck04
[24] 11:40:06 [SUCCESS] rck15
pssh -h PSSH_HOST_FILE.18163 -t -1 -P -p 24 /shared/tekubasx/RING/ringsync 24 0.533 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 < /dev/null
rck00: [0] Started. The token will be resubmitted 100000000 times!
rck00: [0] Token here: 1007
[0] Token here: 2015
rck00: [0] Token here: 4031
rck00: [0] Token here: 8063
rck00: [0] Token here: 16127
rck00: [0] Token here: 32255
rck00: [0] Token here: 64511
rck00: [0] Token here: 129023
rck00: [0] Token here: 258047
rck00: [0] Token here: 516095
rck00: [0] Token here: 1032191
rck00: [0] Token here: 2064383
rck00: [0] Token here: 4128767
rck00: [0] Token here: 8257535
rck00: [0] Token here: 16515071
rck00: [0] Token here: 33030143
rck00: [0] Token here: 66060287
rck17: [17] ~~Completed. Token here: 100000000
[1] 11:43:01 [SUCCESS] rck01
[2] 11:43:01 [SUCCESS] rck00
[3] 11:43:01 [SUCCESS] rck02
[4] 11:43:01 [SUCCESS] rck03
[5] 11:43:01 [SUCCESS] rck04
[6] 11:43:01 [SUCCESS] rck05
[7] 11:43:01 [SUCCESS] rck06
[8] 11:43:01 [SUCCESS] rck07
[9] 11:43:01 [SUCCESS] rck08
[10] 11:43:01 [SUCCESS] rck09
[11] 11:43:01 [SUCCESS] rck10
[12] 11:43:01 [SUCCESS] rck11
[13] 11:43:01 [SUCCESS] rck12
[14] 11:43:01 [SUCCESS] rck13
[15] 11:43:01 [SUCCESS] rck14
[16] 11:43:01 [SUCCESS] rck15
[17] 11:43:01 [SUCCESS] rck16
[18] 11:43:01 [SUCCESS] rck17
[19] 11:43:01 [SUCCESS] rck18
[20] 11:43:01 [SUCCESS] rck19
[21] 11:43:01 [SUCCESS] rck20
[22] 11:43:01 [SUCCESS] rck21
[23] 11:43:01 [SUCCESS] rck22
[24] 11:43:01 [SUCCESS] rck23
tekubasx@marc042:/shared/tekubasx/RING$
On 24 cores I saw it running properly only for two or three hours about 10 days ago. Usually it does not even reach the 1000 msgs.
Now that I tried a build without the SHMADD (after a reset and boot) it ran properly 4 or 5 times but then stopped and gets stuck at around 30 - 120K messages (I tried re-reseting it, but nothing changed).
With sccKit 1.3.0 things were usually running ok, but only on the 533/800/800 MHz setting.
I started a
nohup doit.sh &
on marc101 (1.4.0 with Tile533_Mesh800_DDR800) just so that I could see it fail. I guess what you're saying is that it starts working OK and then after repeatedly running the app, it locks up. These kinds of issues are notoriously hard to debug.
The code looks pretty straightforward. Do you have any speculation about why it locks up?
doit.sh looks like
#!/bin/bash
date
for i in {1..10}
do
echo "RUN = $i"
rccerun -nue 24 -f rc.hosts ringsync
done
date
I filed a bug (232) http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=232
Well, it ran 10 times without failing on marc101. I believe you are seeing the lockup but have not been able to reproduce it yet.
Ted Kubaska wrote:
I guess what you're saying is that it starts working OK and then after repeatedly running the app, it locks up.
No no. What I am saying is that it alsmost never ran properly on more than 12 cores after the 1.4.0 update.
Did you try running it on all 48 cores?
Ted Kubaska wrote:
Do you have any speculation about why it locks up?
It seems to be a flag synchronization problem. I will post a more detailed explanation as soon as possible.

