I'm preloading the LUT entries with the default mapping, where the MPB are located at addresses 0xC0000000 through 0xD7FFFFFF. I then split the tile's MPB into even parts, so Core 0 starts at address 0 of the on-tile buffer, Core 1 at address 0x1000. My program then assumes this default mapping and splitting and determines the destination's system address accordingly.
What it boils down to: Messaging through the cache line at address 0x1000 seems to be slower.
Are you using RCCE? Are you using the non-gory (send/recv) or the gory (put/get) versions? If you are using gory, are you using push (put goes to receiver's MPB) or pull (put goes to sender's MPB). RCCE non-gory send uses pull.
Just to verify ... by putting it in different words ... what you are seeing is an increase in roundtrip time when core 0 exchanges messges with odd-numbered cores. At first look I don't see a reason for this ... but I'd like to understand more about how you are numbering the cores and how you are doing the message-passing.
No, I'm using a very simple self-programmed method to send messages.
The cores are numbered as follows:
Core 0 is the first core on tile (x=0,y=0). Core 1 is the second one on this tile. Core 2 and 3 are the two cores on the tile (x=1,y=0) and so on. Core 46 and 47 are then the cores on tile (x=5,y=3).
For the messaging itself, the "other" core (not 0) writes a 32 byte message in the corresponding message buffer, core 0 receives it and sends a message back. As a concrete example:
- Core 43 writes a cache line to address 0x0 of MPB (x=0,y=0)
- Core 0 receives the message (it polls for a message) and writes another cache line to address 0x1000 of MPB (x=3,y=3)
- Core 43 reads that message
The time for these two messages is measured by Core 43. For each core, I ran 1000000 such tests and took the average round-trip time.
But I just found a programming bug which seemed to have caused the difference. I'll report if the problem is still not solved.
That's how we number cores as well
coreID = (x + 6* y) * 2 + z where z is 0,1
Your concrete example uses the RCCE "push" model. RCCE started with "push" but then went to "pull" which is ... sender writes into its own MPB; receiver takes messge from sender's MPB. RCCE went to "pull" because of the difficulty of implementing RCCE_recv_test() with "push". I don't have details about why RCCE_recv_test() was difficult with "push".
Note that the RCCE gory shift example uses "push", but if you look inside RCCE_send(), you'll see "pull".