This is quite a general question, knowing I am not exposing you to all the information. I am trying to see whether more guys experience the same phenomena when evaluating the SCC speedup factor when spreading an algorithm to more and more cores. You are welcome to ask questions if you think important input is missing. I tried to make this as general as I can.
I wrote an application that runs on either 1 or 33 cores.
The algorithm uses 33 cores to "parallelize" a function (which is run on a single core in a different test). The "parallelization" creates 8 independent chains of 4 cores, implementing 8 systolic arrays. The 33rd core is looping fast and sends messages to the 8 heads of the chains, telling them to start. There is not much to calculate in each core, and I probably pay with performance by managing the MPB pointers. Using 33 cores I expected a speedup of 30.
As an introduction, I tried a very similar algorithm on another many-core platform (not SCC). Indeed I got what I expected (the speedup even went super-linear. There are reasons but they doesn't matter now). The environment there was also without cache coherence.
OK so I have a proof for a good algorithm in my hands. It "parallelizes" well. I migrated the code to fit the SCC platform (it was not among the easiest tasks in my life I must note ).
On the SCC platform I got a speedup factor of 1.3.
I wonder why it's so low and decided to share the details. I may be doing something wrong and you may help me.
- I am using the GORY interface. Please assume I am using it "effectively enough" in my code (enough to get a speedup factor of at least 6, for 33 cores).
- I tried playing with the DDR frequency using "sccBmc -i" and witnessed no performance change. It means to me that the algorithm never (or seldom) goes off chip for data during its main loop.
- When I experience with different tile and mesh frequencies - the result doesn't change much as well (they change by several percents).
- Run time is measured using RCCE_wtime() just before and after the main loop (that has 10 million iterations). I calculate the speedup factor with the delta RCCE_wtime() before and after the main loop. I believe there is no problem with my time measurement, since I get the correct ratio when running 100 iterations or 10M iterations (means I get 100/10M the time spent on algorithm. There are no 10 seconds of setup time, overhead, or whatever one may suspect).
- 7 or 8 inter-core messages are sent in each iteration.
- Cores poll each other's flags prior to sending a message, looking to see an UNSET flag in the destination MPB buffer. There are 6 flags for 6 MPB lines of 32 bytes. After a msg is sent, when the destination core fetches it - it UNSETs the appropriating flag. Supporting this mechanism creates a heavy overhead. For the discussion I am willing to exaggerate: even if we assume it is 5 times heavier than the algorithm code - I should expect a speedup factor of 30/5 = 6.
- I am using the RCCE trunk version from September 9th, 2012.
- I don't use SINGLEBITFLAGS.
- I perform chip reset before testing.
Can you help me out then? What do you believe can be the cause for such a low speedup? Did you ever experience a similar behavior from the SCC?
Thank you for reading and helping!