I have a very strange problem with RCCE_send/RCCE_recv in non-gory mode. Namely, the performance is terrible for message sizes that are not cache line aligned. For instance, it takes less than 50k CPU cycles to transfer a 3200B message, but it takes OVER 100k CYCLES to transfer a SMALLER, 3199B message. The difference grows as i increase the message size.
Before I start to dig through the code, is there a bug or am I missing something? I am using the trunk version of RCCE.
Here's the catch. It is related to the optimized memcpy_get, which writes to the main memory by first prefetching a cache line so as to avoid "write-around" used by P54C. It turned out that the way I designed the test made it perform poorly, as it performs like it should only if the start of the destination memory address is properly aligned.