1 2 Previous Next 15 Replies Latest reply on Sep 13, 2011 5:57 AM by aprell

    Is RCCE_barrier() necessary?

    vmaffione

      Hi all,

        I am writing a parallel application for the SCC platform. My application uses ony RCCE_send and RCCE_recv.

      At the beginning of my experiments, (after RCCE initialization) I didn't use to call RCCE_barrier() before using sends and receives and sometimes my program used to hang without any reason (I'm 100% sure there are not deadlock conditions). After some experiments (some failed!), I decided to call RCCE_barrier() before calling any RCCE_send() or RCCE_recv() and now it seems to work properly!

      My question is: is it possible that my problem was due to the RCCE_barrier() absence?

      More precisely, (I think) it can happen the following:

      Imagine that core A is going to send a message to core B (and of course B is going to receive a message from the core A).

      If core A call RCCE_send before RCCE_init() on core B is completed (or maybe viceversa), A might try to read synchronization flags in the MPB of core B which might not be initialized, and so we don't know what could happen.

       

      Is my analisys correct?

       

      thanks

        Vincenzo

        • 1. Re: Is RCCE_barrier() necessary?
          aprell

          I believe your analysis is exactly correct. Maybe we should add a barrier at the end of RCCE_init.

          • 2. Re: Is RCCE_barrier() necessary?
            tedk

            Andreas, it's a good thing you answered first. I was about to say I did not think it was needed, but now I agree with you. The example RCCE code I looked at usually has a barrier at the beginning, but the comment often was so that the timer would be synchronized.

             

            We need a better procedure for accepting RCCE updates from the community. I think Tim is the gatekeeper for RCCE updates. We need to set up something so that he can easily look at and approve patches if that's what he wants to continue to do.

            • 3. Re: Is RCCE_barrier() necessary?
              aprell

              What is the current procedure? Do you pass patches on to Tim?

              • 4. Re: Is RCCE_barrier() necessary?
                tedk

                Thinking about this some more and talking with Rob ... we don't think RCCE_init() needs a barrier. If you are using rccerun, the mpb should be initialized becasue rccerun does that beore running your application.  It's possible you might need the barrier for some other reason that we are not aware of. Can you post an example code that shows this problem? Thanks.

                • 5. Re: Is RCCE_barrier() necessary?
                  aprell

                  I think Vincenzo pointed out a potential problem: core A wants to send to core B, but core B has yet to initialize its synchronization flags. As a result, the flag update from A will be lost, and A and B deadlock waiting for each other's updates. So I think the problem is not that a core might try to read an uninitialized flag, but rather that it might try to set it.

                   

                  Vincenzo, can you post the code that showed the problem? Otherwise I'll try to come up with an example.

                  • 6. Re: Is RCCE_barrier() necessary?
                    tedk

                    Yes, let's look at the example. We can stick the invocations in a loop and see if we get a hang. I don't know. One day I think it's necessary, and the next day I don't. The data will show.

                    • 7. Re: Is RCCE_barrier() necessary?
                      vmaffione

                      Hi all,

                        Sorry, but I dont' have an example bacause I didn't make experiments in that sense.

                      I simply went through RCCE_init code and I noticed that RCCE_init allocates synchronization flags (flag_sent, falg_ready), so the situation I illustrated is possible.

                      However, in order to show this we can proceed as Ted suggests, just executing (at the RCCE_init beginning) a time-consuming loop only on Core A.

                       

                      I will do it as soon as I have time enough!

                       

                       

                      Best regards

                        Vincenzo

                      • 8. Re: Is RCCE_barrier() necessary?
                        vmaffione

                        I've taken a look at mpb.c code. Basically before rccerun run our program, all MPB locations are set to 0 (If I'm not mistaken).

                         

                        Let's see now what the control flow of RCCE_send does:

                         

                             1) RCCE_put(combuf, (t_vcharp) bufptr, nbytes, RCCE_IAM);
                             2 )RCCE_flag_write(sent, RCCE_FLAG_SET, dest);           /// RCCE_FLAG_SET == 1
                             // wait for the destination to be ready to receive a message         
                             3) RCCE_wait_until(*ready, RCCE_FLAG_SET);
                             4) RCCE_flag_write(ready, RCCE_FLAG_UNSET, RCCE_IAM);

                         

                        and correspondingly what RCCE_recv does:

                         

                            1) RCCE_wait_until(*sent, RCCE_FLAG_SET);
                            2) RCCE_flag_write(sent, RCCE_FLAG_UNSET, RCCE_IAM);
                             // copy data from local MPB space to private memory
                            3) RCCE_get((t_vcharp)bufptr, combuf, nbytes, source);

                            // tell the source I have moved data out of its comm buffer
                            4)RCCE_flag_write(ready, RCCE_FLAG_SET, source);

                         

                        If core A (which is executing RCCE_send ) executes (2) before core B (which will execute RCCE_recv) initializes his "RCCE_flag_sent" array, core B will miss this synchronization step, provided RCCE_flag_alloc set to 0 the locations of that array ( honestly I could not understand if RCCE_flag_alloc set to 0 the flag it is allocating (I'm using the latest RCCE trunk version), but I can imagine that the answer is "yes" ). In this case core A will block forever on (3) and core B will block forever on (1)  ==> deadlock.

                         

                        In order to make an experiment we could insert

                         

                          if ( I am B )

                            sleep( 3 );  // 3 seconds (or milliseconds?? I dont' remember!)

                         

                        at the beginning of RCCE_init.

                         

                         

                        I hope I didn't talk nonsense.

                          Vincenzo

                        • 9. Re: Is RCCE_barrier() necessary?
                          aprell

                          That's exactly the problem I was thinking about. I can't test it before next week though...

                          • 10. Re: Is RCCE_barrier() necessary?
                            vmaffione

                            I know!

                            I only wanted to make the question clear (expecially to me!).

                             

                            As I said, I don't understand what flag_alloc does (I should do more investigations), but if the flag is initialized to 0 (can you confirm this??), the problem

                            does exist.

                             

                            Thank you,

                              Vincenzo

                            • 11. Re: Is RCCE_barrier() necessary?
                              aprell

                              I've run a few tests and it seems to work, but only because flags are not initialized, so there's no value that gets overwritten. I could reproduce the deadlock described above after making sure that flags are initialized with a default value. So yes, it looks like the barrier is not strictly needed...

                              • 12. Re: Is RCCE_barrier() necessary?
                                drodo

                                Has this issue been fixed in any later versions of the RCCE library or should we still intervene manually as indicated above?

                                 

                                Also, is it possible to get the number of a ue (which is needed for the sleep-related if clause above), without having completed the call to RCCE_init? I thought that any RCCE_init should precede any RCCE-related command (hence RCCE_ue that is needed).

                                 

                                Thanks,

                                 

                                Dimitris

                                • 13. Re: Is RCCE_barrier() necessary?
                                  aprell

                                  Dimitrios,

                                   

                                  Last time I checked I couldn't find a bug. Flag locations are not overwritten when they are allocated, so the deadlock from above cannot happen.

                                   

                                  Are you having problems if you don't include a barrier after RCCE_init?

                                   

                                  RCCE_ue returns the value of the global variable RCCE_IAM, which is assigned in RCCE_init. You can look at the corresponding code in RCCE_admin.c and move it out of RCCE_init, if you think that's a good idea.

                                   

                                  There's also a function MYCOREID, which returns a core's physical ID, in case you don't want to use the ranks assigned by RCCE.

                                  • 14. Re: Is RCCE_barrier() necessary?
                                    drodo

                                    Andreas,

                                     

                                    I face no problem with RCCE_init. I just read through the above posts and thought it was the deadlock was a legitimate concern. It seems it is not though, so false alarm.

                                     

                                    Thanks for the info on RCCE_ue and MYCOREID. I will use the latter for out-of-RCCE purposes.

                                    1 2 Previous Next