6 Replies Latest reply on Sep 10, 2012 2:55 AM by saibbot

    SCC performance

    ohntz

      Hello all,

       

      This is quite a general question, knowing I am not exposing you to all the information. I am trying to see whether more guys experience the same phenomena when evaluating the SCC speedup factor when spreading an algorithm to more and more cores. You are welcome to ask questions if you think important input is missing. I tried to make this as general as I can.

       

      Here goes:

       

      I wrote an application that runs on either 1 or 33 cores.

      The algorithm uses 33 cores to "parallelize" a function (which is run on a single core in a different test). The "parallelization" creates 8 independent chains of 4 cores, implementing 8 systolic arrays. The 33rd core is looping fast and sends messages to the 8 heads of the chains, telling them to start. There is not much to calculate in each core, and I probably pay with performance by managing the MPB pointers. Using 33 cores I expected a speedup of 30.

       

      As an introduction, I tried a very similar algorithm on another many-core platform (not SCC). Indeed I got what I expected (the speedup even went super-linear. There are reasons but they doesn't matter now). The environment there was also without cache coherence.

       

      OK so I have a proof for a good algorithm in my hands. It "parallelizes" well. I migrated the code to fit the SCC platform (it was not among the easiest tasks in my life I must note ).

       

      On the SCC platform I got a speedup factor of 1.3.

      I wonder why it's so low and decided to share the details. I may be doing something wrong and you may help me.

       

      Notes:

      • I am using the GORY interface. Please assume I am using it "effectively enough" in my code (enough to get a speedup factor of at least 6, for 33 cores).
      • I tried playing with the DDR frequency using "sccBmc -i" and witnessed no performance change. It means to me that the algorithm never (or seldom) goes off chip for data during its main loop.
      • When I experience with different tile and mesh frequencies - the result doesn't change much as well (they change by several percents).
      • Run time is measured using RCCE_wtime() just before and after the main loop (that has 10 million iterations). I calculate the speedup factor with the delta RCCE_wtime() before and after the main loop. I believe there is no problem with my time measurement, since I get the correct ratio when running 100 iterations or 10M iterations (means I get 100/10M the time spent on algorithm. There are no 10 seconds of setup time, overhead, or whatever one may suspect).
      • 7 or 8 inter-core messages are sent in each iteration.
      • Cores poll each other's flags prior to sending a message, looking to see an UNSET flag in the destination MPB buffer. There are 6 flags for 6 MPB lines of 32 bytes. After a msg is sent, when the destination core fetches it - it UNSETs the appropriating flag. Supporting this mechanism creates a heavy overhead. For the discussion I am willing to exaggerate: even if we assume it is 5 times heavier than the algorithm code - I should expect a speedup factor of 30/5 = 6.
      • I am using the RCCE trunk version from September 9th, 2012.
      • I don't use SINGLEBITFLAGS.
      • I perform chip reset before testing.

       

      Can you help me out then? What do you believe can be the cause for such a low speedup? Did you ever experience a similar behavior from the SCC?

       

      Thank you for reading and helping!

      Ohn

        • 1. Re: SCC performance
          saibbot

          It could be due to the messaging latencies. The synchronization costs are more than the benefits.

           

          Is the local computation on the 8 chains long enough to "cover" the messaging? I suppose yes, since you achieve  the expected performance speedup in another platform.

           

          How do the cores within each chain communicate? It would be interesting to measure the messaging latencies you get in the application.

           

          Vasilis.

          • 2. Re: SCC performance
            ohntz

            There are 8 chains, each with 7 computational parts. The cores pass a computational result to the next in chain, and then it comes back to the head of chain.

            Example:

            A->B->C->D->C->B->A

            I do this to exploit the L1 and L2 caches to the maximum. This is what gives me the super-linear speedup on another platform.

             

            The computational part is indeed minor. I am willing to exaggerate and say the synchronization cost is 5 times heavier than the computational part. Still I do not expect a speedup of 1.3.

             

            In addition, when I try different mesh/tile/DDR frequencies - the result doesn't change much.

             

            How do the cores within each core communicate?

            - They implement 2 FIFOs inside their MPB. One FIFO for the messaging towards the end of the chain (A->B->C->D), and one FIFO that buffers messages going back to the head of chain (D->C->B->A).

            Each FIFO has 3 entries. I have a set of pointers for each FIFO, telling where the next msg should appear in the MPB, and where to write the next message to the destination core. There are 2 sets of flags pointers. In the algorithm when a core SETs a flag of another core, it means that core has finished copying the data to the MPB. UNSETTING a flag means that the destination core has copied the MPB line into its own local memory. I guess this is all standard.

            I must say this is a heavy price to pay if the user is interested in performance. FIFOs can be easily implemented in HW and I assume most applications would need them. Just a point for thought.

             

            Nevertheless I do not believe this is what degrades the speedup so much and I am looking for ideas, or people who report about the same issue.

            • 3. Re: SCC performance
              saibbot

              I am pretty sure the problem is due to communication overheads. I also had some cases where I could not beat a sequential implementation using a parallel.

               

              If you want, you can send me the parts of your app that implement the communication and I will give them a look.

               

              Vasilis.

              • 4. Re: SCC performance
                ohntz

                Thank you, Vasilis.

                 

                I will send you the files a little later. Can't do that at the moment.

                 

                Meanwhile, let's discuss some more. If you are correct - why, then, changing the MESH and TILE frequencies does not affect the results?

                 

                Ohn

                • 5. Re: SCC performance
                  ohntz

                  I am kind of suggesting that there is a "hidden idle time" that I can't explain. It is not TILE/MESH/DDR frequency dependent.

                  • 6. Re: SCC performance
                    saibbot

                    Possibly because the tile frequency also increases, so the bottleneck remains the same.

                     

                    In any case, I think that the increased mesh freq will only help if you have paired calls, e.g., someone busy-waiting to receive something and another core sending. Else, the increased freq should not affect the results significantly.