What are you using to resolve the core number to system address? RCCE?
I'm preloading the LUT entries with the default mapping, where the MPB are located at addresses 0xC0000000 through 0xD7FFFFFF. I then split the tile's MPB into even parts, so Core 0 starts at address 0 of the on-tile buffer, Core 1 at address 0x1000. My program then assumes this default mapping and splitting and determines the destination's system address accordingly.
What it boils down to: Messaging through the cache line at address 0x1000 seems to be slower.
Are you using RCCE? Are you using the non-gory (send/recv) or the gory (put/get) versions? If you are using gory, are you using push (put goes to receiver's MPB) or pull (put goes to sender's MPB). RCCE non-gory send uses pull.
Just to verify ... by putting it in different words ... what you are seeing is an increase in roundtrip time when core 0 exchanges messges with odd-numbered cores. At first look I don't see a reason for this ... but I'd like to understand more about how you are numbering the cores and how you are doing the message-passing.
No, I'm using a very simple self-programmed method to send messages.
The cores are numbered as follows:
Core 0 is the first core on tile (x=0,y=0). Core 1 is the second one on this tile. Core 2 and 3 are the two cores on the tile (x=1,y=0) and so on. Core 46 and 47 are then the cores on tile (x=5,y=3).
For the messaging itself, the "other" core (not 0) writes a 32 byte message in the corresponding message buffer, core 0 receives it and sends a message back. As a concrete example:
- Core 43 writes a cache line to address 0x0 of MPB (x=0,y=0)
- Core 0 receives the message (it polls for a message) and writes another cache line to address 0x1000 of MPB (x=3,y=3)
- Core 43 reads that message
The time for these two messages is measured by Core 43. For each core, I ran 1000000 such tests and took the average round-trip time.
But I just found a programming bug which seemed to have caused the difference. I'll report if the problem is still not solved.
That's how we number cores as well
coreID = (x + 6* y) * 2 + z where z is 0,1
Your concrete example uses the RCCE "push" model. RCCE started with "push" but then went to "pull" which is ... sender writes into its own MPB; receiver takes messge from sender's MPB. RCCE went to "pull" because of the difficulty of implementing RCCE_recv_test() with "push". I don't have details about why RCCE_recv_test() was difficult with "push".
Note that the RCCE gory shift example uses "push", but if you look inside RCCE_send(), you'll see "pull".