2 Replies Latest reply on May 3, 2011 3:40 PM by tedk

    RCKMPI collective communication operations

    mcdelorme

      Hello,

       

      I was wondering if anyone else is experiencing deadlocks when using collective MPI functions (such as MPI_Barrier, MPI_Scan, etc...) on a large number of cores on the SCC.  For example, I find the the following sample code reaches deadlock if I run it on >4 cores:

       

       

      int main( int argc, char **argv ) {
          int mpi_rank;
          int mpi_size;
          unsigned int i = 0;
          MPI_Init( &argc, &argv );
          MPI_Comm_rank( MPI_COMM_WORLD, &mpi_rank );
          MPI_Comm_size( MPI_COMM_WORLD, &mpi_size );
          for( i = 0; i < 10000; i++ ) {
              if( mpi_rank == 0 ) {
                  printf( "About to hit barrier %u\n", i );
              }
              MPI_Barrier( MPI_COMM_WORLD );
          }
          MPI_Finalize();
          return EXIT_SUCCESS;
      }

       

      It seems as though the more cores I launch it on, the fewer times it is able to pass the barrier.  I have compiled RCKMPI to use the SCCMPB channel and am using the default kernel.  Any feedback would be greatly appreciated.

       

      Thanks!

      Mike

        • 1. Re: RCKMPI collective communication operations
          compres

          Hello Mike,

           

          I have been trying to replicate your issue without success.

           

          I ran your code, after adding some headers and modifying the return statement to "return 0", like this:

           

          #include <stdio.h>

          #include <mpi.h>
          int main( int argc, char **argv ) {
            int mpi_rank;
            int mpi_size;
            unsigned int i = 0;
            MPI_Init( &argc, &argv );
            MPI_Comm_rank( MPI_COMM_WORLD, &mpi_rank );
            MPI_Comm_size( MPI_COMM_WORLD, &mpi_size );
            for( i = 0; i < 1000000; i++ ) {
              if( mpi_rank == 0 ) {
                printf( "About to hit barrier %u\n", i );
              }
              MPI_Barrier( MPI_COMM_WORLD );
            }
            MPI_Finalize();
            return 0;
          }
          Notice that I changed the loop from 10,000 to a million.
          I ran mpiexec with with 2, 3, 4, 5 ... up to 12 processes, and then 24, 36 and 48 processes. 
          All my tests completed to the millionth iteration.  These tests take a lot of time (~30min @ 48 processes) so I will stop here.
          Can you provide more details? I am not sure what to ask at this point.
          If possible for you, I would run the test in another SCC system to make sure it's not a hardware problem.
          - Isaías

           

          PS: I ran the tests for 10,000 loops in 2 separate SCC systems.

          • 2. Re: RCKMPI collective communication operations
            tedk

            Isaias, did you run this on a 1.4.0 system with eMAC enabled?  I've moved this discussion to Bugzilla bug 214 because of the possibility of a hw failure. We would need to verify that hw failure before replaing hw on your marc system.

             

            http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=214