5 Replies Latest reply on Jan 2, 2018 8:29 PM by Intel Corporation

    running well with IPoIB mode, but when the data size over a certain number via DAPL mode, the program failed

    shanghua

      Hello,

            We successfully installed the infiniband network and get the rational speed over servers. We also installed the Intel MPI. But in order to provide DAPL transport way, we installed the DAPL-ND as DAPL_PROVIDER which is a software produced by OFED. So we use the command “mpiexec -genv I_MPI_DEBUG=5 -genv I_MPI_FABRICS=shm:dapl -genv I_MPI_DAPL_PROVIDER=ND0 -n 2 -ppn 1 -hosts 11.4.12.11,11.4.12.12 MPIPassing_3.exe” to run the Intel MPI program.

            However, we face a serious problem that when we use the dapl mode. We design a test program to test the intel mpi dapl speed. This program’s mainly work is to use the standard Send(MPI_Send) to send the same picture data again and again . And another process is to use standard receive function(MPI_Recv) to receive those picture data on turn, One process on a server is responsible for sending picture data on turn. And another process on another server has duty to receive the picture data on turn.

            we set the picture number which is the number of sending the same picture, when we run the program. When we set the picture number to more than 50(picture number>=50) and the size of picture is 314MB. The program evokes the error just like below.

           dapls_ib_mr_register() NDRegister:  (Dat_Status type 0x40000 subtype 0)   @ line 1685 flags 0x7 len 329117696 vaddr 000002457531A000

            At the same time, the connection to another recevice process on the receive-responisible server dropped. We have no idea why error just like that happened and why we lose our connection. The most strangest thing is that we can ordinary run this program with the same picture size and the same picture number via the IPoIB mode. We just use the command “mpiexec -genv I_MPI_DEBUG=5 -genv I_MPI_FABRICS=shm:tcp -genv I_MPI_TCP_NETMASK=ib -n 2 -ppn 1 -hosts 11.4.12.11,11.4.12.12 MPIPassing_3.exe”. Even when we set the picture number to 100, it still works well. So we can conclude our code is good and our program doesn’t exist the leak of memory because of our program’s running well via IPoIB mode.

           However, as long as we run the program with the picture number over 50 as well as the picture size “314MB” via the DAPL mode, it will emerge error like what l have mentioned above.  When using the DAPL mode, with 314MB picture size and picture number<50, it can run successfully. Another phenomenon is that with 43MB picture size and picture number =100 via DAPL mode, it can run well.

           So we doubt if there is a limitation of sending data size?  Does the size of the RAM memory effect the limitation of sending data size? Or we did some wrong but we don’t know. Why with the picture size>314MB ,picture number>50, the sending server will lose the connection to the receving servers?

           Eventually, there is a significant thing that we run our intel mpi commands on the sending-responsible server.

           Can someone help me? Any advice may help us. We are looking forward to your help. We demonstrate the critical part of our code as follow.

       

      Our code for this Send Node process is like this:

      MPI_Barrier(MPI_COMM_WORLD);

      for (int i = 0; i < ComTimes; i++)

      {

           pProcessNode->sendMessage(pPic->PicData, Buf_Size, DataType, RECV_NODE, CommTag);// my sendMessage() is equal to the function:MPI_Send()

           CommTag--;

      }

       

      And code for Recv Node process is like this:

      MPI_Barrier(MPI_COMM_WORLD);

      for (int i = 0; i < ComTimes; i++)

      {

           uchar *pRecvData = new uchar[Buf_Size];

            pProcessNode->recvMessage(pRecvData, Buf_Size, DataType, SEND_NODE, CommTag, RecvMode);

            CommTag--;

            if (i == 0)

           {

                Start = MPI_Wtime();

           } delete pRecvData; }

            End = MPI_Wtime();

           TimeTotal = End - Start;

           std::cout << "Standars total time is:" << TimeTotal << std::endl;