1 2 Previous Next 28 Replies Latest reply on Jul 8, 2011 9:01 AM by tedk

    16 is too much

    vmaffione

      Hi all,

        Running a big program I keep having a weird error [ERROR 139], which is supposed to be a segmentation fault, but I don't uderstand how it is possible in my case. The problem is that my program (which is a genetic algorithm that can run on N cores) works perfectly if the number of cores is strictly less than 16, but it returns immediately error 139 otherwise.

       

      The program contains no constants related with memory allocation or array indexes. So why 16? Is it possible that the error is due to the fact that the executable is big (3.17 Mb)? I don't know what to think about it. I tried to port the program to sockets (so replacing RCCE_send & receive with socket functions), and it worked perfectly.

       

      How can I debug it on SCC?

       

      I'm working on marc026.

       

      Thank you so much,

        Vincenzo

        • 1. Re: 16 is too much
          aprell

          Hi Vincenzo,

           

          What do you mean when you say the program segfaults immediately? Is it happening in RCCE_init?

          • 2. Re: 16 is too much
            tedk

            How are you running your program? Are you using rccerun or pssh? If you login directly to a core and invoke your program directly on the core, you can get a better error message. That's how we see that 139 is a seg fault.

             

            You can either run your program on just one core or broadcast the invocation to slected cores. When I want to broadcast I bring up sccKonsole on the cores I am using and then configure one core to broadcast input to the other cores and then run on that core.

             

            We've also been able to "sort of" use gdb on one core. It doesn't work 100% but has worked well enough to solve some issues.

            • 3. Re: 16 is too much
              tedk

              BTW, if you use gdb on a single core ... it's best to compile your program with gcc and not icc. With an icc-compiled program, we couldn't set a breakpoint on a line number (even though we could display line numbers); we could break at a function; but then we seg faulted when we tried to step.

              • 4. Re: 16 is too much
                vmaffione

                Hi!

                  Thank you for your reply.

                I'm sorry for replying now, but I put the SCC apart for a while!

                 

                Actually, I don't know how to understand when the segmentation fault happens.

                 

                I tried to handle the SIGSEGV signal when running my program with 16 cores.

                In this case (and only in this case) the error doesn't show up, but something weird happens (and I don't understand why):

                my program get stuck on all cores.

                It seems that my signal handler, which prints something and then calls exit(0) isn't executed (otherwise the program should terminate, shouldn't it?).

                However, if I don't install a signal handler for SIGSEGV (or ignore SIGSEGV), error 139 is returned.

                 

                I don't know how to find out if something goes wrong in the RCCE_init() or later (debug print are useless).

                 

                Ted was talking about gdb, but how to use it on a core?

                 

                 

                Thanks,

                  Vincenzo

                • 5. Re: 16 is too much
                  vmaffione

                  Hi Ted

                    thank you for your reply.

                   

                  I found out where the error is, and it turned out to be something extremely weird, that I can't understand!

                   

                  First of all, I use rccerun, nongory, no power management, and no single-bit flags. I compile with g++ (my application is in C++).

                   

                  I found the exact instruction (actually there are two of them) which causes a segmentation fault using a log file on each core. Each log file is mantained in the corresponding filesystem. Executing a flush() after each write makes sure that the write is actually done.

                   

                  Even if I used a log file for each core (stored locally, as I said before), only two cores get a segmentation fault. The two failing cores are alwayse the same, and their rank are 14 and 15 (in my opinion this is not a case, because they are the last two cores, so their internal flags are stored on the borders of the flag arrayr: why I am saying this? Let me explain in the following!).

                   

                  I found out that the error comes out while accessing an internal synchronization flag (using the nongory interface it can't be my fault!) during a message transfer from the core 15 to the core 14 (by the way this happens during the first message transfer).

                  More precisely the "sent" flag stored in the mpb of core 14 for the communication with core 15 (which is accessed both from the RCCE_send() called by core 15 and from RCCE_recv() called by core 14) causes a segmentation fault: core 15 fails while attempting a write (RCCE_flag_write()) and core 14 fails while attempting a read (RCCE_wait_until()).

                   

                  By the way, the message size is smaller than a chunk size, so the  cycles in RCCE_send() and RCCE_recv are not entered in this case.

                   

                  To be extremely precise, the two failing instruction are:

                   

                  *target = *source;   // in RCCE_put_char (nongory nosinglebitflag version) for the core 15

                   

                  while ((RCCE_FLAG_STATUS)(*flaga) != val);   // in RCCE_wait_until for the core 14

                   

                   

                   

                  How is it possible? It seems to be a bug!

                  It can't be my fault because the invalid memory access is due to a synchronization flag, which is an internal feature of the library. My code is not involved in the error (moreover, as I said in the previous mail, I ported my application to standard sockets ad it works perfectly whatever number of core I specify).

                   

                   

                  If you want to try all these thing, log on marc026, /shared/vmaffione/apps/ga. Then execute "compila.sh" which compile also RCCE modified with log writes.

                  Then type "rccerun -nue 16 -f h n". Cores rck38 (rank 14) and rck39(rank15) will fail, while the other cores will stuck (but still working properly).

                  Accessing rck39 and rck39 you will find the log file "ehi.txt". The last log writes will show you the exact point of failure, as I explained before.

                  If you want try again, just execute "reset.sh", which will clear the mpb of the 16 involved cores (rck24 to rck39) and kill all the stuck process "n".

                   

                  Thanks for the attention,

                    Vincenzo

                  • 6. Re: 16 is too much
                    aprell

                    That's interesting... From what you're saying, it might be a good idea to take a closer look at RCCE_flag_alloc. When you run your program with 16 cores, RCCE_init is supposed to fill up a flag line with 32 flags. You could try to include a few checks to see if the flags are really properly set up...

                    • 7. Re: 16 is too much
                      jheld

                      I'd recommend that you look at the assembly language output for those lines of code (they will be one or more instructions).  Also disable all optimizations and try it.

                      • 8. Re: 16 is too much
                        tedk

                        Here's how you would use gdb on a single core.

                         

                        Go to http://marcbug.scc-dc.com/svn/repository/tarballs/

                        Download gdb_scc.tar.

                        Open it up in your /shared directory.

                        Log onto the core to run it.

                        Here's an example. The set args refer to the args that rccerun gives pssh. In the example, it is one node (1) with a tile frequency of 0.533 (0.533) on node 0 (00).

                         

                        gdb works better on the core with gcc than with icc.

                         

                        tekubasx@marc101:/shared/tekubasx$ ssh root@rck00

                        root@rck00:~> /shared/gdb-install/bin/gdb /shared/tekubasx/hello

                        GNU gdb (GDB) 7.1

                        Copyright (C) 2010 Free Software Foundation, Inc.

                        License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

                        This is free software: you are free to change and redistribute it.

                        There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

                        and "show warranty" for details.

                        This GDB was configured as "i386-unknown-linux-gnu".

                        For bug reporting instructions, please see:

                        <http://www.gnu.org/software/gdb/bugs/>...

                        Reading symbols from /shared/tekubasx/hello...done.

                        (gdb) set args 1 0.533 00

                        (gdb) r

                        Starting program: /shared/tekubasx/hello 1 0.533 00

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                        Hello from RCCE ... I am 1.0.13.x

                         

                        Program exited normally.

                        (gdb) quit

                        root@rck00:~> exit

                        Connection to rck00 closed.

                        tekubasx@marc101:/shared/tekubasx$

                        • 9. Re: 16 is too much
                          tedk

                          Here's a text file that discusses a problem with gdb and icc on the cores.

                          • 10. Re: 16 is too much
                            vmaffione

                            Thank you, Ted. I'll do it as soon as possible.

                             

                              Vincenzo

                            • 11. Re: 16 is too much
                              vmaffione

                              Right!

                                In fact it could be that memory for that particular flag that fails isn't properly allocated. I'll check it.

                               

                              By the way, don't you think that this kind of bug (if I'm not mistaken, of course) it's something extremely serious? Basically RCCE would be unreliable..

                               

                               

                              Thanks,

                                Vincenzo

                              • 12. Re: 16 is too much
                                saibbot

                                Did you try reproducing the problem in a smaller example?

                                • 13. Re: 16 is too much
                                  vmaffione

                                  Not yet.

                                  I've just found the error. Whateverthecase I am going to do it.

                                   

                                  Thanks,

                                    Vincenzo

                                  • 14. Re: 16 is too much
                                    tedk

                                    No one is discounting the seriousness of this bug. What I hope to do on this end is verify that I can reproduce it and then file a Bugzilla bug. I think it would be a P1/critical if not a P1/blocker. You can file the bug yourself if you like. When forum discussions get very nitty-gritty, discussing actual code, we usually move them to Bugzilla. This allows us to prioritize and track and get more internal eyes on the issue.

                                     

                                    If you have a proposed fix to the RCCE code, that would be great. We could attach the fix to the bug and run some tests.

                                    1 2 Previous Next