I experienced some issues sending messages from a core to another using RCCE_get() and RCCE_put(). Each message sent is followed by sending a flag, signaling a message was sent. When A core has received and processed a message (where processing means checking if all values are the ones expected, then reset it to another "marker" value), then and only then it sends an acknowledgement to the sender core. The process is asynchronous, which means the sender core does not wait to receive an acknowledgement to finish its send operation, but it checks it later before sending any other message; the receiver follows a similar procedure. All N cores (from 2 to 48) play this scenario where core n sends to core n - 1. A cyclic variant allows core 0 to send to core N - 1.
I implemented this scenario on Marc038 using RCCE_put() and RCCE_get() for messages and RCCE_put() and RCCE_get() or RCCE_flag_write() and RCCE_flag_read() for flags (depending on implementation variants). Messages are a sequence of 1 coded in a 4 bytes int and markers are the hexadecimal value 0xdeadbeef. Find simple_skip_msg source files attached to this post and nevermind skip_msg, as I think this implementation is buggy. If I run it with all cores, I almost always end up an error after a few dozens or hundreds thousands messages sent: some core find 0xdeadbeef instead of 1111111111 in the message; most of the time a sequence of 32*n 0xdeadbeef values, but sometimes, exceptionally, less than 32. 0xdeadbeef can be at the beginning of the message, at its end or in the middle. It shows the flag arrived before the message, even though its corresponding RCCE_put()/flag_write() was run before the one for the actual message. Note that marc038 Rocky Creek chip was replaced recently and all cores run now the mpbtest perfectly (the one from Pablo Rebble, as discussed in bug report 487 http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=487). As I reported the bug related to the scenario I describe here (http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=495), my chip was replaced once again, with not better success.
You can compile source files using the usual compilation tools icc. Use the command line below, where $PELIB_HOME points to the root directory of RCCE sources and where libRCCE_bigflags_gory_nopwrmgmt.a as well as mpb are compiled in bin/SCC_LINUX.
make MPB_SIZE=8128 PELIB_HOME=$PELIB_HOME
Run the test case using the command below:
rccerun -nue 48 -f hostfile -clock 0.533 simple_skip_msg
Where hostfile lists all cores from 00 to 48, in natural order. If you don't do anything, the program will run indefinitely and output faulty buffers when it find some (as well as a few more informations). If you want to stop them all in an easy way, help yourself with the command below:
for i in `seq 0 47`; do ( ( ssh root@rck`printf %02d $i` killall -s INT simple_skip_msg; echo Core $i done. ) & ) 2>/dev/null; done; wait 2>/dev/null
It seems to me that memory consistency is not the one expected (I expect what you write first is what reaches destination first, as the documentation says about the on-chip network). I couldn't find such discussion on communities.intel.com or any more detail about it in documentation. Have you experienced the same issue? Did I understand anything wrong about the SCC and RCCE? Can you see any flaw in my test?
As a follow up, I upgraded my test so that each time it detects a wrong message (containing some instances of 0xdeadbeed), then it waits one second and checks the buffer again; then it always reads the expected value. This shows the messages was not yet arrived when the flag reached the received core and advocate for a weak memory consistency. It also means one can find a workaround to this issue, consisting in checking the whole message received for any instance of 0xdeadbeef again and again until no marker is detected anymore, process the message then reset it to 0xdeadbeef in order to take the next transmission. This has the cost of reading at least once and writing once the whole buffer before and after doing anything. Does anyone has a alternative solution?
skip_msg.tar.gz 5.9 K