I believe your analysis is exactly correct. Maybe we should add a barrier at the end of RCCE_init.
Andreas, it's a good thing you answered first. I was about to say I did not think it was needed, but now I agree with you. The example RCCE code I looked at usually has a barrier at the beginning, but the comment often was so that the timer would be synchronized.
We need a better procedure for accepting RCCE updates from the community. I think Tim is the gatekeeper for RCCE updates. We need to set up something so that he can easily look at and approve patches if that's what he wants to continue to do.
What is the current procedure? Do you pass patches on to Tim?
Thinking about this some more and talking with Rob ... we don't think RCCE_init() needs a barrier. If you are using rccerun, the mpb should be initialized becasue rccerun does that beore running your application. It's possible you might need the barrier for some other reason that we are not aware of. Can you post an example code that shows this problem? Thanks.
I think Vincenzo pointed out a potential problem: core A wants to send to core B, but core B has yet to initialize its synchronization flags. As a result, the flag update from A will be lost, and A and B deadlock waiting for each other's updates. So I think the problem is not that a core might try to read an uninitialized flag, but rather that it might try to set it.
Vincenzo, can you post the code that showed the problem? Otherwise I'll try to come up with an example.
Yes, let's look at the example. We can stick the invocations in a loop and see if we get a hang. I don't know. One day I think it's necessary, and the next day I don't. The data will show.
Sorry, but I dont' have an example bacause I didn't make experiments in that sense.
I simply went through RCCE_init code and I noticed that RCCE_init allocates synchronization flags (flag_sent, falg_ready), so the situation I illustrated is possible.
However, in order to show this we can proceed as Ted suggests, just executing (at the RCCE_init beginning) a time-consuming loop only on Core A.
I will do it as soon as I have time enough!
I've taken a look at mpb.c code. Basically before rccerun run our program, all MPB locations are set to 0 (If I'm not mistaken).
Let's see now what the control flow of RCCE_send does:
1) RCCE_put(combuf, (t_vcharp) bufptr, nbytes, RCCE_IAM);
2 )RCCE_flag_write(sent, RCCE_FLAG_SET, dest); /// RCCE_FLAG_SET == 1
// wait for the destination to be ready to receive a message
3) RCCE_wait_until(*ready, RCCE_FLAG_SET);
4) RCCE_flag_write(ready, RCCE_FLAG_UNSET, RCCE_IAM);
and correspondingly what RCCE_recv does:
1) RCCE_wait_until(*sent, RCCE_FLAG_SET);
2) RCCE_flag_write(sent, RCCE_FLAG_UNSET, RCCE_IAM);
// copy data from local MPB space to private memory
3) RCCE_get((t_vcharp)bufptr, combuf, nbytes, source);
// tell the source I have moved data out of its comm buffer
4)RCCE_flag_write(ready, RCCE_FLAG_SET, source);
If core A (which is executing RCCE_send ) executes (2) before core B (which will execute RCCE_recv) initializes his "RCCE_flag_sent" array, core B will miss this synchronization step, provided RCCE_flag_alloc set to 0 the locations of that array ( honestly I could not understand if RCCE_flag_alloc set to 0 the flag it is allocating (I'm using the latest RCCE trunk version), but I can imagine that the answer is "yes" ). In this case core A will block forever on (3) and core B will block forever on (1) ==> deadlock.
In order to make an experiment we could insert
if ( I am B )
sleep( 3 ); // 3 seconds (or milliseconds?? I dont' remember!)
at the beginning of RCCE_init.
I hope I didn't talk nonsense.
That's exactly the problem I was thinking about. I can't test it before next week though...
I only wanted to make the question clear (expecially to me!).
As I said, I don't understand what flag_alloc does (I should do more investigations), but if the flag is initialized to 0 (can you confirm this??), the problem
I've run a few tests and it seems to work, but only because flags are not initialized, so there's no value that gets overwritten. I could reproduce the deadlock described above after making sure that flags are initialized with a default value. So yes, it looks like the barrier is not strictly needed...
Has this issue been fixed in any later versions of the RCCE library or should we still intervene manually as indicated above?
Also, is it possible to get the number of a ue (which is needed for the sleep-related if clause above), without having completed the call to RCCE_init? I thought that any RCCE_init should precede any RCCE-related command (hence RCCE_ue that is needed).
Last time I checked I couldn't find a bug. Flag locations are not overwritten when they are allocated, so the deadlock from above cannot happen.
Are you having problems if you don't include a barrier after RCCE_init?
RCCE_ue returns the value of the global variable RCCE_IAM, which is assigned in RCCE_init. You can look at the corresponding code in RCCE_admin.c and move it out of RCCE_init, if you think that's a good idea.
There's also a function MYCOREID, which returns a core's physical ID, in case you don't want to use the ranks assigned by RCCE.
I face no problem with RCCE_init. I just read through the above posts and thought it was the deadlock was a legitimate concern. It seems it is not though, so false alarm.
Thanks for the info on RCCE_ue and MYCOREID. I will use the latter for out-of-RCCE purposes.