5 Replies Latest reply on Sep 6, 2011 12:23 PM by tedk

    Different performance on same tile?

    markus_pm

      Hi,

       

      in my current experiments I am getting results I can't really understand. What I try to do is sending a 32 Byte Message round-trip between Core 0 and any other core of the SCC ({1..47}). My expectation was that the round-trip time would increase from core 1 to core 11, then drop down to the value of core 2, increase all the way through the cores 12..23, and so on. In general, this is the case. But instead of getting roughly the same values for cores 2 and 3, 4 and 5, etc. the cores with odd numbers are _always_ worse than the even ones. I attached a figure to illustrate my current results.

       

      The question is now if that is a reasonable bahaviour, and if so, what could be an explanation. Or could a programming mistake produce such "consistent" misbehaviour?

       

      Thanks,

       

      Markus

        • 1. Re: Different performance on same tile?
          jheld

          What are you using to resolve the core number to system address?  RCCE?

          • 2. Re: Different performance on same tile?
            markus_pm

            I'm preloading the LUT entries with the default mapping, where the MPB are located at addresses 0xC0000000 through 0xD7FFFFFF. I then split the tile's MPB into even parts, so Core 0 starts at address 0 of the on-tile buffer, Core 1 at address 0x1000. My program then assumes this default mapping and splitting and determines the destination's system address accordingly.

             

            What it boils down to: Messaging through the cache line at address 0x1000 seems to be slower.

            • 3. Re: Different performance on same tile?
              tedk

              Are you using RCCE? Are you using the non-gory (send/recv) or the gory (put/get) versions? If you are using gory, are you using push (put goes to receiver's  MPB) or pull (put goes to sender's MPB). RCCE non-gory send uses pull.

               

              Just to verify ... by putting it in different words ... what you are seeing is an increase in roundtrip time when core 0 exchanges messges with odd-numbered cores. At first look I don't see a reason for this ... but I'd like to understand more about how you are numbering the cores and how you are doing the message-passing.

              • 4. Re: Different performance on same tile?
                markus_pm

                No, I'm using a very simple self-programmed method to send messages.

                 

                The cores are numbered as follows:

                 

                Core 0 is the first core on tile (x=0,y=0). Core 1 is the second one on this tile. Core 2 and 3 are the two cores on the tile (x=1,y=0) and so on. Core 46 and 47 are then the cores on tile (x=5,y=3).

                 

                For the messaging itself, the "other" core (not 0) writes a 32 byte message in the corresponding message buffer, core 0 receives it and sends a message back. As a concrete example:

                 

                - Core 43 writes a cache line to address 0x0 of MPB (x=0,y=0)

                - Core 0 receives the message (it polls for a message) and writes another cache line to address 0x1000 of MPB (x=3,y=3)

                - Core 43 reads that message

                 

                The time for these two messages is measured by Core 43. For each core, I ran 1000000 such tests and took the average round-trip time.

                 

                But I just found a programming bug which seemed to have caused the difference. I'll report if the problem is still not solved.

                • 5. Re: Different performance on same tile?
                  tedk

                  That's how we number cores as well

                  coreID = (x + 6* y) * 2 + z where z is 0,1

                   

                  Your concrete example uses the RCCE "push" model. RCCE started with "push" but then went to "pull" which is ... sender writes into its own MPB; receiver takes messge from sender's MPB. RCCE went to "pull" because of the difficulty of implementing RCCE_recv_test() with "push". I don't have details about why RCCE_recv_test() was difficult with "push".

                   

                  Note that the RCCE gory shift example uses "push", but if you look inside RCCE_send(), you'll see "pull".