1 Reply Latest reply on Dec 4, 2017 3:28 PM by Intel Corporation

    Issues with mpich/mpiexec on Dual Intel Xeon E5-2697A v4 System

    bcapasso

      I am attempting to run an application that uses mpich/mpiexec to assign threads to cores. I have had no trouble extensively running the same application on similar hardware (Dual Intel Xeon E5-2680-V3 system), in the same version of Fedora (23), which suggests to me that this is memory/hardware related .

       

      I note that the E5-2680-V3 system (which runs the application without issue) does not have TSX-NI, whereas the E52697A v4 system does. Could this be the issue? Is it possible to disable TSX-NI on my E5-2697A v4 system to diagnose this? Otherwise, would updating the CPU microcode help?

       

      Very little debugging info is given when the application fails, but given how quick it fails after execution, it is quite clear that something is very wrong here:

       

      [wri@wrimodels12 runs]$ ems_domain --localize midatl

       

        Starting UEMS Program ems_domain (V15.99.8) on wrimodels12 at Sat Dec  2 20:18:28 2017 UTC

       

          *  Localizing "midatl" domain - /home/wri/wrfems/uems/runs/midatl

       

                Primary Domain

                  Projection          : lat-lon

                  Standard Longitude  : -41 Degrees

                  Reference Latitude  : 42 Degrees

                  Reference Longitude : -41 Degrees

                  Grid NX x NY        : 495 x 165

                  Grid Spacing        : 0.170 Degrees

                  Geog Dset Res       : modis_lakes+modis_30s+modis_15s+10m

       

          *  Burn'n up 32 processors to localize your domain. Please ignore the smoke  - Failed (11)

             !  Error running GEOGRID - System Signal Code (SN) : 11 (Invalid Memory Reference - Seg Fault)

       

             While perusing the log/domain_geogrid_stdout.log file I saw the following:

       

               >  YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

               

                Also use the --nogeogrid flag for debugging.

       

      [wri@wrimodels12 static]$ /home/wri/wrfems/uems/util/mpich2/bin/mpiexec -n 32 /home/wri/wrfems/uems/bin/geogrid

       

      ===================================================================================

      =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

      =   PID 47823 RUNNING AT wrimodels12

      =   EXIT CODE: 11

      =   CLEANING UP REMAINING PROCESSES

      =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

      ===================================================================================

      YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

      This typically refers to a problem with your application.

      Please see the FAQ page for debugging suggestions

        • 1. Re: Issues with mpich/mpiexec on Dual Intel Xeon E5-2697A v4 System
          Intel Corporation
          This message was posted on behalf of Intel Corporation

           
          : Thank you very much for contacting the Intel® communities. We will do our best to try to provide the information you are looking for.
           
          In regard to your inquiry about if the problem with the application is related to the Intel® E5-2697A v4 processor supporting TSX-NI, it is hard to tell for sure, it will depend on the requirements of the application itself. Depending on the model of the board, you might be able to disable it in the BIOS of it or by doing a BIOS update.
           
          Now, remember that the tests done by Intel were done using Windows as operating system, since you are using Fedora, in this case we recommend to visit their forums for further technical assistance on this subject:
          https://fedoraforum.org/
           
          Any further questions, please let me know.
           
          Regards,
          Alberto R