1 of 1 people found this helpful
What we measured is that the read/write latency of the local MPB would be 10 or 11 cycles, and 5 or 6 cycles with the bypass bit enabled (dont use that though, it is broken). It's not too difficult to measure time in cycles using the rdtsc instruction. We always use this C macro for ease of use, using a uint64_t variable as argument:
/* Macro to read the TSC register */
#define ASM_READ_TSC(X) asm volatile("rdtsc" : "=A" (X))
I don't think there are multiple read/write requests possible at the same time, as without the bypass bit the mesh interface/router is the only component talking to the MPB. I assume this simply queues up MPB requests as soon as one is already being serviced, and they would be ordered just as the order they arrive in at the tile from the network.
I hope this helps you answer your question.
Thanks for the answer. How exactly do you get 10-11 cycles for the read/write latency? According to the SCC Programmers Guide, the read latency of the local MPB for one cache line is 45 core + 8 mesh cycles... compared to 15 core cycles with the bypass bit. Maybe the document is outdated, but what I am actually able to observe (without bypass) is even more than that. What I measure with rdtsc is the time it takes to execute a MOV instruction that internally causes an MPB access. I don't know if I can go more fine-grained than that...
1 of 1 people found this helpful
Actually I just realised that my answer was not entirely what you were looking for. The 10/11 cycles latency is a longer term average when reading a lot of data from the MPB. However, this is of course split into one high latency read to bring the cacheline from MPB to L1 and then several low latency accesses that hit in the L1 cache. So probably the 45 core + 8 mesh cycles figure should be more accurate.
If you want to measure this, using rdtsc on a single mov instruction won't get you very far. You will be mostly measuring the overhead of rdtsc, so you would want to measure over a few hundred of MPB accesses at least. Of course you could hop from cacheline to cacheline on every consecutive read, but there is one other pitfall. As the L1 cache is write-back, reading in data from the MPB might cause an eviction when it replaces a dirty line in the cache. This would invoke a writeback of the line to either L2 or, even worse, main memory. In an attempt to avoid this you can flush the L1 cache before starting your measurement with the wbinvd instruction, but you need to be in kernel mode to execute this...
In fact, that is exactly what it do: I have a loop that repeatedly brings a cache line from the MPB. Of course, there is a loop overhead then, but I assume that is only a few cycles per iteration. However, I did notice some very strange behavior when trying to hop from one line to another. It might be related to what you explained, although it happens only in some very specific scenarios that involve many cores -- that is why I asked about the way the MPB is accessed in the first place. I might need some more help later, when I have tried some more experiments, so I can be more specific. Thanks!
Have you also checked the actual instruction stream the compiler generates, or did you hand code your benchmark in assembly? If you write in C it is always good to check the generated assembly with the -save-temps compiler flag or the objdump command. You want to make sure the measurement loop operates completely inside the registers and does not incur any additional memory accesses. Also when measuring read latency from the MPB you want to 'blackhole' the data that you read, you don't want this to be store somewhere incurring another memory access. (of course with memory access I mean either a memory or cache access, it's not fully predictable)
Sure, totally valid points. I had already taken all that into account. The OS is another possible source of non-determinism, I might try with baremetal and see if it changes anything.
@darence - now that we have a new ETI FW that works with 184.108.40.206 and your marc system is up and running, have you tried this with baremetal?
I put the new ETI Framework on our local marc101 (which is running 220.127.116.11), and the "are-you-alive" baremetal test works.
tekubasx@marc101:~/ETI$ gcc_scc -o tmpi.scc share/sample_code/tmpi.c
tekubasx@marc101:~/ETI$ sccReset -g
INFO: Welcome to sccReset 1.4.1 (build date Jun 28 2011 - 16:00:14)...
INFO: Applying global software reset to SCC (cores & CRB registers)...
INFO: (Re-)configuring GRB registers...
tekubasx@marc101:~/ETI$ sccBoot -s
INFO: Welcome to sccBoot 1.4.1 (build date Jul 4 2011 - 16:14:13)...
Status: The following cores can be reached with ping (booted): No cores!
tekubasx@marc101:~/ETI$ launcher -z4 tmpi.scc
1 received from 2 300
2 received from 1 300
0 received from 3 300
3 received from 0 300
ETI really stepped up and upgraded their framework. We'd be interested in seeing how the community is using this framework.
just as a random sidenote, its not necessary to use sccReset or sccBoot, the launcher
can configure and run the chip after sccBmc -i
it is true that without an operating system there wont be any timer interrupts or scheduling
contention the distribution will be tighter, but it shouldn't be too hard to throw out the outlyers.
i guess the primary benefit would be from removing any sources of cache contention
@Ted - I've managed to get the same test running. However, I'm facing some issues when trying to use RCCE. Is anyone out there more familiar with that?
Are you wanting to use RCCE with baremetal? RCCE used to work with baremetal and can again (I don't know exactly what needs to be done, but I do not believe it is extensive).
I don't know of any current docs that describe the timing issues you bring up. I think it's something you have to measure. Do you think the new sccUART with baremetal would help?
you can find a more detailed description of the MPB memory access latencies in our paper from the last MARC Symposium in Ettlingen: https://idun.informatik.tu-cottbus.de/mcc/wiki/publications The article also contains our measurement results.
Thanks, your paper is an interesting read indeed. However, my original question still stands.
I'll try to be more specific. For example, when writing to the MPB, I would like to know how many tile cycles this actually takes (not from issuing the instruction until its completion, but rather only the part that actually accesses the MPB). Then, is there a buffer for pending requests on the MPB, or they are simply buffered by the router on the corresponding tile? Some of this questions are very hard to answer experimentally, because such fine-grained tests are impossible to come up with. On the other hand, understanding this would help us explain the results of some already conducted experiments.
I guess there is a document dealing with this type of questions (the implementation of the MPB) more profoundly, or at least a person that knows how all this works internally. Anyone?
So which part of the time do you consider to be "actually accessing the MPB" ? Do you mean the time it takes from the data to propagate from the router into the MPB memory? I agree this is difficult to measure, I can't quite think of a way right now how you could do that. I'm also not sure why you would need this exact figure, perhaps you can explain your experiment and results that you are trying to figure out.
As for the buffering, data is only buffered in the routers as far as I am aware. So this could be in the local router or one further down the chip if there is a lot of network backpressure. Since the bypass bit does not work, all traffic to/from MPB's goes through the routers, including traffic from the two local cores. Also one core can only have a single outstanding read or write request so no buffering would be needed from that side anyway.
Maybe you want to know how long it takes the destination router to actually write the data into the SRAM? ...without any overheads in cores or network latencies.
In that case you can do a simple experiment: Let P cores read/write concurrently N times from/to the same SRAM module and measure the time to complete this (lets call it T). The processing overhead per request at the module cannot be larger than T/N/P. For large N you should get a good impression.
Repeat this with different P from 1 to 48 and plot a curve. There will be a point, where T beginns to grow much faster. This P is where you actually saturate the SRAM module or destination router.
For example, I got around 5 cycles (measured in core clock) per SRAM read request and saturation kicked in with 19 cores. For the atomic counters the numbers are more interesting: up to 8 cores can access them without noticing any delay and the request take around 33 cycles. Of course, the increment time visible on a core is much larger (240 to 270 cycles).