7 Replies Latest reply on Jan 6, 2012 6:17 AM by drodo

    Question on GORY API :: RCCE_send & recv

    drodo

      Hello all,

       

      I am trying a simple application with the gory RCCE API and get a weird behaviour.

       

      It involves an integer, initialized to 10. First, Core 1 prints this value. In the meantime, Core 0 changes the value to 100 and sends it to Core 1. Finally, Core 1 is supposed to print the new value. The code goes as follows:

       

      /* include etc... */
      
      int main(int argc, char **argv){
      
              int number=10;
              RCCE_init(&argc, &argv);
              RCCE_FLAG ready,sent;
      
              RCCE_flag_alloc(&ready);
              RCCE_flag_alloc(&sent);
              
      
              t_vcharp buffer=RCCE_malloc(sizeof(int));
      
              int ID = RCCE_ue();
      
              /* Core 0 */
              if (ID==0) {
      
                      number=100;
                      RCCE_send((char *) &number,buffer,sizeof(int),&ready,&sent,sizeof(int),1);
      
              }
      
              else {
      
                      printf("The old value of the variable is %d\n",number);
                      RCCE_recv((char *) &number,buffer,sizeof(int),&ready,&sent,sizeof(int),0);
                      printf("The new value of the variable is %d\n",number);
      
              }
      
              RCCE_free(buffer);
              RCCE_flag_free(&ready);
              RCCE_flag_free(&sent);
              RCCE_finalize();
              return 0;
      
      }
      

       

      However, both printfs display the initial value of the integer (i.e. 10):

       

      pssh -h PSSH_HOST_FILE.19486 -t -1 -p 2 /shared/drodo/mpb.19486 < /dev/null
      [1] 15:28:20 [SUCCESS] rck10
      [2] 15:28:20 [SUCCESS] rck11
      pssh -h PSSH_HOST_FILE.19486 -t -1 -P -p 2 /shared/drodo/test 2 0.533 10 11 < /dev/null
      rck11: The old value of the variable is 10
      The new value of the variable is 10
      [1] 15:28:21 [SUCCESS] rck10
      [2] 15:28:21 [SUCCESS] rck11
      

       

      Am I using RCCE_send/RCCE_recv in the correct way (assuming a GORY API)? I really cannot find fault with this one!

       

      Thanks for your time.

       

      Regards,

       

      Dimitrios

        • 1. Re: Question on GORY API :: RCCE_send & recv
          aprell

          Hi Dimitrios,

           

          RCCE_malloc() requires a size argument that is a multiple of a cache line: 32, 64, etc. In your example code, RCCE_malloc() returns NULL.

          • 2. Re: Question on GORY API :: RCCE_send & recv
            drodo

            Works ok with multiples of 32 as the RCCE_malloc argument!

            Thanks for the help!

            • 3. Re: Question on GORY API :: RCCE_send & recv
              drodo

              I apologize for the consequtive posts, but I see another strange behaviour:

               

              I run the same source code (with the difference that the RCCE_malloc'ed space is now 32). After a couple of consequtive successful runs, there are cases when the execution freezes and the pssh of the executable doesn't return. This occurs especially if I change the couple of cores where the above application runs.

               

              Any thoughts?

               

              Thanks for your time!

              • 4. Re: Question on GORY API :: RCCE_send & recv
                aprell

                Can you identify the point where the program hangs?

                • 5. Re: Question on GORY API :: RCCE_send & recv
                  drodo

                  Well, the problem is that any debugging printfs/fprintfs are not returning anything and I am unable to run gdb per core. Also, another weird behaviour is that after I kill the hanging processes from the respective cores, recompile the exact same source and rccerun again, the processes return successfully!

                   

                  Since the memory allocation (be that on the MPB or not) is a basic concern, I have question in the vicinity of RCCE_send/recv:

                   

                  I read in the non-gory implementation of RCCE_send that the communication buffer that is used (RCCE_buff_ptr) is not actually allocated with RCCE_malloc, but is initialized at the second MPB cache line during RCCE_init.

                   

                  When using RCCE_send/recv in the gory mode, does the common buffer have to be RCCE_malloced in advance or a pointer to a specific MPB cache line (e.g. the second) suffices?

                   

                  Thanks again for your time!

                  • 6. Re: Question on GORY API :: RCCE_send & recv
                    tedk

                    I don't understand why your printfs do not get printed. With the latest sccKit 1.4.1.3, I don't see that problem. We did have a "heavy I/O" issue with 1.3.0. Lots of printfs caused the cores to hang. Lots was a relative term; it didn't take very many to cause a problem.  What system are you running on? Do you have eMAC enabled? If you cat your systemSettings.ini, you should see an sccMacEnable line. If you are on a data center system, LMK which one, and I'll check the configuration. Note that we are now looking at an sccKit 1.4.2.

                     

                    Yes, RCCE_malloc() does not actually do a malloc(). In gory mode it does do a malloc() but that malloc() is not the data that gets sent. Rather it's a structure that is the beginning of a circular list, the elements of which point to offsets in the postion of the MPB assigned to that core. RCCE does in RCCE_admin.c:RCCE_init() call RCCE_malloc.c:RCCE_malloc_init(). In non-gory mode, RCCE_malloc_init() just assigns some pointers. In gory mode, it creates the first element of that circular list.

                     

                    I'm assuming you are coding in gory mode. If that's true, you do need to know the pointer into the core's MPB. You don't really have to get it from RCCE_malloc(); you might figure out another way to get it, but RCCE_malloc() is the easist way I think. You must also allocate flags with RCCE_flags.c:RCCE_flag_alloc().

                     

                    I attached a PDF of some old (very rough) notes I made about RCCE_malloc(). I read them over and think they are still accurate; I hope it's helpful. If you see something wrong in this file, please let me know.

                    1 of 1 people found this helpful
                    • 7. Re: Question on GORY API :: RCCE_send & recv
                      drodo

                      Overall Observation:

                      Trying to run parallel code with RCCE_send & RCCE_recv in GORY mode continued to cause the same behaviour. As a result, I switched to one sided communication across the participating cores and all seems to be going well (the implementationis similar to Fig. 4 of [1]).

                       

                      Regarding the printfs:

                      I was unable to find a good explanation of their not working. I noticed that the same executable that hanged, was running successfully only after {recompilation and/or MPB clearing from the sccGui} with no changes to the source! The sccKit I am using is 1.4.1. and I am running on a local SCC, not on the data centre. However, the printf behaviour is of reduced priority for my work since the communication problem was eventually solved. The result of the proposed cat'ing goes as follows:

                       

                      $ cat systemSettings.ini | grep sccMacEnable
                      sccMacEnable=a

                       

                      Regarding RCCE_malloc:

                      Both your explanation and the notes are very helpful feedback.

                       

                      Thanks for your time.

                       

                      [1] Mattson, T.G.; Van der Wijngaart, R.F.; Riepen, M.; Lehnig, T.; Brett,  P.; Haas, W.; Kennedy, P.; Howard, J.; Vangal, S.; Borkar, N.; Ruhl, G.;  Dighe, S.; , "The 48-core SCC Processor: the Programmer's View," High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for , vol., no., pp.1-11, 13-19 Nov. 2010