1 2 3 Previous Next 33 Replies Latest reply on Jun 8, 2012 11:44 AM by mwaughex

    MPI job killed: exit status of rank 0: killed by signal 9

    papandya

      Hello All

       

      I was using mpirun, the job is fairly long and is supposed to run beyond 4-5 mins on 36 cores but I keep getting the message:

       

      rank 0 in job 22  rck00_60956   caused collective abort of all ranks
        exit status of rank 0: killed by signal 9

       

      I checked and see that its a SIGKILL generated, its definitely not being done by my application and I feel its being done due to use of too much CPU time. I have run jobs which have been longer than this a few weeks back but only recently do I see such an abort. Can someone please guide me on what needs to be done and get the job to run without causing an abort by the system.

        • 1. Re: MPI job killed: exit status of rank 0: killed by signal 9
          tedk

          There is no limit put on your CPU time. If you are sharing the system with other users, you may have set something up, but Intel is not restricting your CPU time. If you are using one of the systems reserved for Intel, we have set up a reservation system. If you go over the time alloted to you, the next user may kill your job (unless you arrange something with that other user).

           

          I have found that sometimes ssh does time out on me. For long jobs I often use nohup or set up my ssh to send keepalive packets (usually every 10 sec).

          1 of 1 people found this helpful
          • 2. Re: MPI job killed: exit status of rank 0: killed by signal 9
            wbrozas

            I have ran a process on all 48 cores using mpich for over 10 minutes and have had no problems. It is however common that when you recieve a signal 9 it was actually killed because another thread died due to a signal 11 (SIGSEGV) and the signal 11 might not always show up.

             

            Im assuming you set up the mpd ring correctly, using the newest sccKit, and your program started and was killed after a few minutes. Do you check when you allocate memory that it does not return a NULL pointer?

            • 3. Re: MPI job killed: exit status of rank 0: killed by signal 9
              papandya

              I guess its probably not being killed by the system as I too ran a process that ran longer than 10 minutes but with lesser number of processors. It must be some other issue, I will look into my application and try to find out if as suggested there is mem alloc problem.

              • 4. Re: MPI job killed: exit status of rank 0: killed by signal 9
                papandya

                I finally figured out that the actual failure was occuring when I did mpi_gather at the root. I tried finding some kind of resolution to my problem and at some blogs have mentioned that if we recieve the sig 9 abort then moving to MPICH2 version 1.0.5 might help. Does anyone have a suggestion on this?

                • 5. Re: MPI job killed: exit status of rank 0: killed by signal 9
                  tedk

                  This is a question for Isaias

                  • 6. Re: MPI job killed: exit status of rank 0: killed by signal 9
                    compres

                    Hi

                     

                    What wbrozas suggested is what I would also say.

                     

                    You mention that you don't get the error with lower process counts.  How does your application scale in terms of memory given the number of processes?

                     

                    If possible given your project, could you share the specific code in order to reproduce the issue?  How large is the buffer where you perform the gather?

                    • 7. Re: MPI job killed: exit status of rank 0: killed by signal 9
                      papandya

                      Hello All

                       

                      The code is essentially trying to do a matrix multiplication between 2 square matrices of type Float with dimensions of both matrices being 3000*3000. Its not execution time that creates the problem but, it looks like the memory that is being allocated dynamically for the program to run. I had used these values to do evaluations on the SCC earlier and it had worked. Now it fails everytime I run the calculation(3000*3000) with these parameters. Actually I wish to achieve 3600*3600 and 4900*4900 matrix calculations finally. What should be done to resolve this. As I want to calculate the result for these big matrices I wish to know if there is any work around at all for this or not?

                       

                      regards

                       

                      Parth

                      • 8. Re: MPI job killed: exit status of rank 0: killed by signal 9
                        papandya

                        As a note I would like to point that I tried using int instead of float but it failed in that case too.

                        • 9. Re: MPI job killed: exit status of rank 0: killed by signal 9
                          compres

                          Ca you reset and freshly boot your SCC and do a:

                           

                          cat /proc/meminfo

                           

                          before and after running your MPI job, and posting both outputs? Only at the root process is enough.

                           

                          - Isaías

                          • 10. Re: MPI job killed: exit status of rank 0: killed by signal 9
                            papandya

                            Hey Isaias

                             

                            I am attaching the file with meminfo in it taken on rck00. Please let me know if there is some other info required.

                             

                            Parth

                            • 11. Re: MPI job killed: exit status of rank 0: killed by signal 9
                              compres

                              One last thing to make sure, can you do a:

                               

                              tail -f /var/log/messages

                               

                              at each core except root (at root you can do "tail -100 /var/log/messages" after the job finished in error) just before launching your application.  Then launch your application and wait until the error occurs.  After you get the signal 9, check the output at each core.  If you indeed ran out of memory, it should be indicated in at least one of the cores.  When a core runs out of memory, the process manager kills the MPI job and you get the error where mpiexec was issued.

                               

                              Some systems have 320 and other 640MBs of RAM per core (16GB or 32GB total, repectively), in your case there are only 320 available.  It can be that your system has in fact 640MBs per core, but the kernel is configured to use 320MB only; if that is the case, then changing the kernel can double your available RAM.  You will need to ask for this information.

                               

                              What happens if you try with smaller data sets?

                               

                              - Isaías

                              • 12. Re: MPI job killed: exit status of rank 0: killed by signal 9
                                papandya

                                Hello Isaias

                                 

                                I got round to collecting the debug messages and it does run out of memory. Here is the dump for it:

                                 

                                Jan 11 02:31:18 (none) user.err kernel: Out of Memory: Kill process 185 (python2.6) score 2123 and children.
                                Jan 11 02:31:18 (none) user.err kernel: Out of memory: Killed process 186 (cannon).
                                Jan 11 02:31:18 (none) user.warn kernel: oom-killer: gfp_mask=0x280d2, order=0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c0129a95>] out_of_memory+0x75/0xa0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c012abb4>] __alloc_pages+0x274/0x2a0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c013c1a8>] shmem_getpage+0x188/0x5e0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c0257f72>] __mutex_lock_slowpath+0xc2/0x2e0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c0257f8b>] __mutex_lock_slowpath+0xdb/0x2e0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c013ca77>] shmem_file_write+0x57/0x2c0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c013cbd5>] shmem_file_write+0x1b5/0x2c0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c014195a>] vfs_write+0x7a/0xf0
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c0141a7d>] sys_write+0x3d/0x70
                                Jan 11 02:31:18 (none) user.warn kernel:  [<c0102569>] syscall_call+0x7/0xb
                                Jan 11 02:31:18 (none) user.info kernel: Mem-info:
                                Jan 11 02:31:18 (none) user.warn kernel: DMA per-cpu:
                                Jan 11 02:31:18 (none) user.warn kernel: cpu 0 hot: high 0, batch 1 used:0
                                Jan 11 02:31:18 (none) user.warn kernel: cpu 0 cold: high 0, batch 1 used:0
                                Jan 11 02:31:18 (none) user.warn kernel: DMA32 per-cpu: empty
                                Jan 11 02:31:18 (none) user.warn kernel: Normal per-cpu:
                                Jan 11 02:31:18 (none) user.warn kernel: cpu 0 hot: high 90, batch 15 used:12
                                Jan 11 02:31:18 (none) user.warn kernel: cpu 0 cold: high 30, batch 7 used:0
                                Jan 11 02:31:18 (none) user.warn kernel: HighMem per-cpu: empty
                                Jan 11 02:31:18 (none) user.warn kernel: Free pages:        3492kB (0kB HighMem)
                                Jan 11 02:31:18 (none) user.warn kernel: Active:72232 inactive:5893 dirty:0 writeback:0 unstable:0 free:873 slab:1226 mapped:64138 pagetables:106
                                Jan 11 02:31:18 (none) user.warn kernel: DMA free:1328kB min:112kB low:140kB high:168kB active:12948kB inactive:0kB present:16384kB pages_scanned:15042

                                all_unreclaimable? no
                                Jan 11 02:31:18 (none) user.warn kernel: lowmem_reserve[]: 0 0 304 304
                                Jan 11 02:31:18 (none) user.warn kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
                                Jan 11 02:31:18 (none) user.warn kernel: lowmem_reserve[]: 0 0 304 304
                                Jan 11 02:31:18 (none) user.warn kernel: Normal free:2164kB min:2172kB low:2712kB high:3256kB active:275980kB inactive:23572kB present:311296kB pages_scanned:309296 all_unreclaimable? yes
                                Jan 11 02:31:18 (none) user.warn kernel: lowmem_reserve[]: 0 0 0 0
                                Jan 11 02:31:18 (none) user.warn kernel: HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
                                Jan 11 02:31:18 (none) user.warn kernel: lowmem_reserve[]: 0 0 0 0
                                Jan 11 02:31:18 (none) user.warn kernel: DMA: 0*4kB 0*8kB 1*16kB 1*32kB 0*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1328kB
                                Jan 11 02:31:18 (none) user.warn kernel: DMA32: empty
                                Jan 11 02:31:18 (none) user.warn kernel: Normal: 1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2164kB
                                Jan 11 02:31:18 (none) user.warn kernel: HighMem: empty
                                Jan 11 02:31:18 (none) user.warn kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0
                                Jan 11 02:31:18 (none) user.warn kernel: Free swap  = 0kB
                                Jan 11 02:31:18 (none) user.warn kernel: Total swap = 0kB
                                Jan 11 02:31:18 (none) user.info kernel: Free swap:            0kB
                                Jan 11 02:31:18 (none) user.info kernel: 81920 pages of RAM
                                Jan 11 02:31:18 (none) user.info kernel: 0 pages of HIGHMEM
                                Jan 11 02:31:18 (none) user.info kernel: 1338 reserved pages
                                Jan 11 02:31:18 (none) user.info kernel: 1091 pages shared
                                Jan 11 02:31:18 (none) user.info kernel: 0 pages swap cached
                                Jan 11 02:31:18 (none) user.info kernel: 0 pages dirty
                                Jan 11 02:31:18 (none) user.info kernel: 0 pages writeback
                                Jan 11 02:31:18 (none) user.info kernel: 64138 pages mapped
                                Jan 11 02:31:18 (none) user.info kernel: 1226 pages slab
                                Jan 11 02:31:18 (none) user.info kernel: 106 pages pagetables

                                 

                                With 2400*2400 matrices it works fine but as the next step I take is 3000*3000 it becomes a problem. Whom should I ask for the info regarding the SCC system having 640 Mb,s per core or not.

                                 

                                Parth

                                • 13. Re: MPI job killed: exit status of rank 0: killed by signal 9
                                  compres

                                  I would imagine you have a contact person at the datacenter.  Perhaps Ted can help you with this.

                                   

                                  - Isaías

                                  • 14. Re: MPI job killed: exit status of rank 0: killed by signal 9
                                    papandya

                                    Hello Ted

                                     

                                    Is it possible to assign more RAM to marc037 system so that I can complete the set of programs I want to run? I am getting an out of memory error in this and would like to know if something can be done about it.

                                     

                                    regards

                                     

                                    Parth

                                    1 2 3 Previous Next