1 of 1 people found this helpful
If you run the PINGPONG example includded with RCCE, the output are basically round trip latencies at different mesage sizes.
The 32 byte results is equal to the "zero byte" latency, if I understand correctly, since 32 bytes are written as a minimum. This would be better confirmed by the RCCE creators, in any case, I imagine the 32 byte latency to be at least very close to the zero byte one.
IF its low latency you want, RCCE is the way to go. Running in the default mode on SCC, we see a roundtrip latency of 5 microseconds. More importantly, if you look at how RCCE is implemented, we get exactly the slope you'd expect for a 4 cycle transit through a router as you plot round-trip latency vs the number of hops across the network. See the attached plot for details.
rcce_latency.gif 30.1 K
Oh, thanks for the elaborate answer. I'd really like to reproduce those results! Can I run that PINGPONG test on SCC remotely?
1 of 1 people found this helpful
Yes you can.
Pingpong is a demo app shipped with rcce.
The bottom boundary of five microseconds is somewhat confusing, because, if I try to *roughly* estimate the simplest ping-pong scenario between two cores in the same tile, I can't even get close to one microsend. Can you please explain that five microsend boundary?
Hello again Tim ,
What am I missing here ? 5 usec latency between the cores sounds to me like a disaster.
Even worst then a ping pong on a ETH fabric between remote computers.
Are you familiar with co libraries of the RCCE that improve this issue ?
I think you may be confusing µsec with microseconds there... for a local network RTT, 5ms is quite normal, but 5µs would be extraordinarily fast, and I don't think even ETH has that sort of blazingly fast network! ;-)
I've been able to reproduce the ~5µs lower boundary for the pingpong test; we've seen values around 7.5µs for 32 bytes of data. For larger message sizes, the times increase and intersect with loopback networking around 4K message size, around 100ms.
Hope this helps!
According to wiki (http://en.wikipedia.org/wiki/Microsecond):
The goal is to understand the descrepancy between the 5 microsecond boundary of RCCE ping pong and *nanosecond* scale of hardware accesses:
How from 100-200 core clocks needed to access the local MPB we can reach mind blowing enormous number of 5 microsecond?!
If you say ~200 clocks at 533MHz doesn't that mean ~0.375usec for each operation on the other MPB or config regs (tst/set bit)? or have I scrambled the arithmetic?
That means the equivalent of 10-15 operations for everything, including time for cache invalidates, synchronization, call & return instructions, misc flag manupulation.
Doesn't surprise me all that much, particularly for the high-level interface.
But more importantly, it's not a product but an example implementation with all source available at http://marcbug.scc-dc.com/svn/repository/trunk/rcce/
Take a look, to see what is happening. It would be great if you would improve or extend it as the RTWH folks did with non-blocking support and optimized buffer copies (Carsten Clauss and Stefan Lankes' iRCCE http://communities.intel.com/message/110482#110482).
As for me, the math is ok, but some clarifications about the clock numbers (generously provided by by Haas Werner from Intel <firstname.lastname@example.org> on Barrelfish mailing list) are essential:
"The latency table reflects the numbers from looking at the actual hardware, i.e. without taking software operation into account. The RCCE round-trip times, however, were measured by running an actual application, i.e. they rather reflect the efficiency of one particular communication algorithm than hardware properties. I do not know the precise number but there are actually several MPB accesses involved in passing data via RCCE."
"I consider the numbers as highly trustworthy as we got them through simulating the logic we built. If software is used to derive these results one has to take uncertainties in program execution into account. Even in a BareMetal environment and with this simple in-order core I doubt that you can measure times with single clock cycle precision because there is jitter among pairs of RDTSC if there are outstanding memory operations in the pipeline.
Regarding your latency-related questions, please note that all times in the latency table are measured from the output of the core, i.e. this implies a L1 miss. For memory accesses with the MPBT attribute bit set the L2 cache is transparent and as far I can see in the implementation it does not matter whether the L2 is enabled or not. So the times for your red and orange scenarios are the ones from the table, i.e. either 15 core or 45 core + 8 mesh clock cycles. I never thought about the impact on access latencies if either paging in the MMU or caching in L1 was disabled. Such time savings occur inside the P54C core, i.e. they do not affect the latencies listed in the table."
Furthermore, additional data may be obtained from the latest release of Barrelfish to support SCC chip: www.barrelfish.org.
> But more importantly, it's not a product but an example implementation with all source available at http://marcbug.scc-dc.com/svn/repository/trunk/rcce/
> Take a look, to see what is happening. It would be great if you would improve or extend it as the RTWH folks did with non-blocking support
> and optimized buffer copies (Carsten Clauss and Stefan Lankes' iRCCE http://communities.intel.com/message/110482#110482).
I hope to get to the implementation and possible improvements in the due time when the project I am engaged in will progress.
Thanks for the answer,