2 Replies Latest reply on Aug 12, 2011 1:36 AM by junghyun

    What is the benefit of WB+MPBT compared to WT+MPBT?

    junghyun

      In order to read from the MPB, we need a CL1INVMB instruction in both cases.

       

      What about writes?

       

      If we set WB+MPBT, we need a CL1INVMB before write to make sure there is no data in the L1 cache.

       

      However, in case of WT+MPBT, we don't need a CL1INVMB because the data will also be written to the main memory even if there is data in the L1 cache.

       

       

      Is there any performance benefits of WB+MPBT?

      Or, using WT+MPBT is better?

       

       

      What do you think?

        • 1. Re: What is the benefit of WB+MPBT compared to WT+MPBT?
          markus_pm

          My guess would be that WT is in some cases faster than WB. Especially for sequential accesses, the WT overhead might kill performance. But this is definitely something worth evaluating, as guesses don't help very much

           

          If the need for the invalidate has a measurable impact on the performance would also be interesting, but since it takes only one cycle, the problem is maybe not the instruction itself, but the consequences (i.e., potentially unrelated data being invalidated even though not necessary).

           

          Bottom line, I think theory isn't gonna help very much here. You'd need to have some experimental results comparing both strategies and then try to explain them.

          • 2. Re: What is the benefit of WB+MPBT compared to WT+MPBT?
            junghyun

            Thank you for your reply.

             

            Just think about programmability.

             

            This is the implementation of a counter, where array MPB is located in the MPB.

             

            1) WB+MPBT

            CL1INVMB;

            int tmp = MPB[0];

            CL1INVMB;

            MPB[0] = tmp+1;

             

            2) WT+MPBT

            CL1INVMB;

            MPB[0]++;

             

             

            In case of WT+MPBT is easy to program. That is less error-prone.

            We should insert CL1INVMB before only reads in case of WT+MPBT, while we should insert CL1INVMB before reads/writes in case of WB+MPBT.

            My opinion is that WT+MPBT is better if there is no difference on the performance.

             

             

             

             

            Then, we can think there is really no different performance or not on both policies.

            I ran an experiment with that counter example.

             

            Since default SCC Linux does not support WT+MPBT, I modified Linux source code (rckmem.c).

            The iteration was 100,000. Core 0 is used. It is executed 10 times on each policy.

             

            1) WB+MPBT : 11,480,525  (average of 10 runs)

            2) WT+MPBT : 9,073,639    (average of 10 runs)

             

            WT+MPBT shows better performance(26%) in this experiment.

             

             

             

            In addition, the values cannot be propagated to the main memory due to write-combine buffer.

            So, I added flush code. This is the final code.

             

            1) WB+MPBT

            for( i = 0; i < MAX_ITERATION; ++i )

            {

              CL1INVMB;

              int tmp = MPB[0];

              CL1INVMB;

              MPB[0] = tmp+1 ;

             

              // flush MPB

              CL1INVMB;

              MPB[32]=0xBADBAD;

            }

             

            2) WT+MPBT

            for( i = 0; i < MAX_ITERATION; ++i )

            {

              CL1INVMB;

              MPB[0]++;

             

              // flush MPB

              MPB[32]=0xBADBAD;

            }

             

            -Result-

            WB+MPBT: 27,120,984 (average of 10 runs)

            WT+MPBT: 20,815,097 (average of 10 runs)

             

            Again, WT+MPBT shows better performance(30%).

             

            The reasons of the performance benefit can be a compiler optimization and cache effect.

            At least two invalidate_mpb() functions are reduced in WT+MPBT.

             

             

            In summary, WT+MPBT can perform better and easy to program.

            It could be a somewhat better example for WT+MPBT because it continuously updating MPB.

            But, I can say WT+MPBT is better.