14 Replies Latest reply on May 16, 2011 11:16 PM by samv

    L1 Cache and MPB behavior

    samv

      Hi,

       

      I'm working on a async (interrupt driven) extension of RCCE and I'm running into L1 cache flushing issues (to the MPB).

       

      Can you please clarify the following questions to help me understand the behavior of the cache:

       

      1. When is the data from the L1 actually evicted to the MPB?

       

      - RCCE uses RCCE_fool_write_combine_buffer to ensure that the write combining buffer is flushed (to the L1) but how about to the MPB? When does that happen? How can I guarantee that it gets there?

       

      2. Is there any coherence between the L1 caches? What happens if I want to write two adjacent bytes to a cores MPB from two different cores?

       

      - RCCE uses gather flags for barriers (which are adjacent bytes in a cache line). If there is no coherence, and you can't partially flush a cache line to the L1, how can you prevent a race? (other than to ensure that each flag has its own cache line).

       

      Thanks!

      - Sam

        • 1. Re: L1 Cache and MPB behavior
          mwvantol

          samv wrote:

           

          Hi,

           

          I'm working on a async (interrupt driven) extension of RCCE and I'm running into L1 cache flushing issues (to the MPB).

           

          Can you please clarify the following questions to help me understand the behavior of the cache:

           

          1. When is the data from the L1 actually evicted to the MPB?

           

          - RCCE uses RCCE_fool_write_combine_buffer to ensure that the write combining buffer is flushed (to the L1) but how about to the MPB? When does that happen? How can I guarantee that it gets there?

           

          When you access the MPB through RCCE or by an mmap from /dev/rckmpb , the data is tagged as MPBT and will accumulate in the write combine buffer in the Memory unit. As soon as a whole cache line is written, or a write to another cache line happens (i.e. an address which is further away then a 32byte offset), the write combine buffer writes the data to the MPB. The RCCE_fool_write_combine_buffer does exactly this, writing to another cacheline so that the buffer is flushed. I suggest you have a look at the SCC Extended Architecture Specification, page 32, section 10.1.2 for more details on L1/L2/MPB and MPBT interaction.

           

          samv wrote:

           

          2. Is there any coherence between the L1 caches? What happens if I want to write two adjacent bytes to a cores MPB from two different cores?

           

          - RCCE uses gather flags for barriers (which are adjacent bytes in a cache line). If there is no coherence, and you can't partially flush a cache line to the L1, how can you prevent a race? (other than to ensure that each flag has its own cache line).

          There is no coherence between L1 caches, however, using the above trick of flushing the write combine buffer, you can write single bytes (or more) to an MPB, therefore allowing for updates on adjacent bytes originating from different cores.

           

          I hope this answers your questions

          • 2. Re: L1 Cache and MPB behavior
            aprell

            Another important bit is the use of RC_cache_invalidate() to invalidate all L1 cache lines that contain MPB data. A subsequent write will cause a write miss and force the data to be written to the MPB.

             

            There is a related thread:
            http://communities.intel.com/message/121141

            • 3. Re: L1 Cache and MPB behavior
              samv

              Michiel van Tol wrote:

               

              There is no coherence between L1 caches, however, using the above trick of flushing the write combine buffer, you can write single bytes (or more) to an MPB, therefore allowing for updates on adjacent bytes originating from different cores.

               

              I'm not convinced this is actually happening (i.e. flushing to the MPB at a byte granularity). This is very atypical for cache design. The flusing is usually done at a line granularity (32 bytes).

               

              Currently I'm losing updates to adjacent bytes (originating from different cores). I am performing RC_cache_invalidate() to make sure the core doesn't update a stale cache line. Then I'm writing a single byte. And I'm writing an extra line to fool the write combine buffer to flush the updates.

               

              It would be amazing if someone from Intel can verify the granularity of the L1 flush after fooling the write combining buffer.

               

              Thanks,

               

              Sam

              • 4. Re: L1 Cache and MPB behavior
                mwvantol

                There are some other options as well. For example marking your accesses uncachable or write through, then you are sure that you will only write single bytes, words, or dwords. However, how this can be done depends on your situation as this is controlled by bits on virtual memory pages in the pagetable. Are you developing under Linux? Or Baremetal? Are you using RCCE and/or the SCC_API or are you building your own software stack from the ground up?

                 

                Also, as you are interested in asynchronous messaging using interrupts, which is also a field we have been working in, I can recommend reading the techreport the guys from the Barrelfish project wrote (number 5, it's contained within the Barrelfish release) on their efforts of targeting the SCC. They also required asynchronous messages to implement their form of IPC, and made an interesting analysis.

                • 5. Re: L1 Cache and MPB behavior
                  mwvantol

                  samv wrote:

                   

                  Currently I'm losing updates to adjacent bytes (originating from different cores). I am performing RC_cache_invalidate() to make sure the core doesn't update a stale cache line. Then I'm writing a single byte. And I'm writing an extra line to fool the write combine buffer to flush the updates.

                   

                   

                  Are you actually just writing single bytes to an MPB, or are you performing a read-update-write cycle on these bytes?

                   

                   

                   

                  If you only write, the pseudo code should be;

                   

                   

                   

                  rc_cache_invalidate();
                  <write byte(s)>
                  RCCE_fool_write_combine_buffer = 1;
                  

                   

                   

                  If it is read-write-update then:

                   

                   

                   

                  rc_cache_invalidate();
                  <read byte(s)>
                  rc_cache_invalidate();
                  <write modified byte(s)>
                  RCCE_fool_write_combine_buffer = 1;
                  

                   

                   

                   

                  da

                  It is important to invalidate before the write, otherwise your write might have a cache hit in L1 and indeed not be transferred to MPB at that moment, which might happen at a later stage when the line is replaced from the cache and then indeed the whole line would be written to MPB. If you invalidate before the write you make sure that you have a cache miss in L1, and as both L1 and L2 caches are 'write around' on miss, and not 'allocate on write', the written byte(s) should go directly through to MPB. However, as MPBT tagged data is affected by the write combine buffer it can get stored there first, so you have to use the extra 'fool wcb' write to avoid that.

                   

                  I hope this helps, or is this exactly what you have been doing?

                  • 6. Re: L1 Cache and MPB behavior
                    samv

                    Exactly what I'm doing. Literally calling RCCE_flag_write with SINGLEBITFLAGS undefined (so 1 byte/flag)

                    • 7. Re: L1 Cache and MPB behavior
                      aprell

                      I haven't used the new byte flags in RCCE... Could you try out the attached example? It's a bit of a hack, but should generally work...

                      • 8. Re: L1 Cache and MPB behavior
                        tedk

                        Sam, I don't know what you mean by "ensure that the write combining buffer is flushed (to the L1)" It's the "to the L1" that confuses me.

                         

                        Here's what I think.

                         

                        L1 and L2 are configured as write-back. They are not write-allocate. Sometimes people call a cache that is not write-allocate, write-around. The write combine buffer is 32 bytes. This write combine buffer (WCB) only has meaning for memory typed as MPBT. Non-burst below means you write what you have. You are sending a cache line with byte enables. So you can effectively write a byte this way.

                         

                        So ... start a write

                        Is this an L1 hit?

                           no ... is it MPBT

                                      no ... is it in L2

                                                   no ... write to memory ... non-burst

                                                   yes . write to L2

                                      yes ... write to WCB

                                                 flush WCB to memory if WCB is filled (32 bytes)

                                                                                  or next write to an address that is not MPBT

                                                                                  or write an MPBT memory that is not consecutive

                            yes .. go to L1

                        • 9. Re: L1 Cache and MPB behavior
                          samv

                          Hi Ted,

                           

                          This is the path I'm interested in (highlighted in green):

                           

                          Ted Kubaska wrote:

                           

                          Is this an L1 hit?

                             no ... is it MPBT

                                        no ... is it in L2

                                                     no ... write to memory ... non-burst

                                                     yes . write to L2

                                        yes ... write to WCB

                                                   flush WCB to memory if WCB is filled (32 bytes)

                                                                                    or next write to an address that is not MPBT

                                                                                    or write an MPBT memory that is not consecutive

                              yes .. go to L1

                           

                          I'm curious what is actually flushed to memory (highlighted in red).

                           

                          Based on the code that Andreas provided, here is my understanding of what is happening on the hardware, please correct as necessary:

                           

                          Assume:

                          t_vcharp p = RCCE_malloc(32);
                          ID = RCCE_ue();
                          

                           

                           

                          RC_cache_invalidate();

                          1. All MPBT lines are invalidated

                           

                           

                          long offset = (long)p - (long)RCCE_comm_buffer[ID];
                          *(char *)(RCCE_comm_buffer[2] + offset) = 'A';
                          

                          2. The entire cache line (from: RCCE_comm_buffer[2] + offset to RCCE_comm_buffer[2] + offset + 32) is brought into the L1 cache (Specifically to the WCB) from the MPB memory

                          3. The first byte of the cache line is modified in the WCB

                           

                           

                          *(int *)RCCE_fool_write_combine_buffer = 1;
                          // where RCCE_fool_write_combine_buffer = RC_COMM_BUFFER_START(RCCE_IAM), i.e. first line of the MPB in a core
                          

                          4. The entire cache line from #2 (32 bytes) is written back to the MPB memory

                           

                           

                          Thanks,

                           

                          Sam

                          • 10. Re: L1 Cache and MPB behavior
                            mwvantol

                            samv wrote:

                            long offset = (long)p - (long)RCCE_comm_buffer[ID];
                            *(char *)(RCCE_comm_buffer[2] + offset) = 'A';

                             

                            The *(char *) should be a *(volatile char*) to make sure that it gets written to memory (or MPB in this case) exactly at that point and that the compiler is not allowed to reorder this write. But I'm not sure if this can cause the problem you're experiencing though.

                            • 11. Re: L1 Cache and MPB behavior
                              samv

                              That's a good point, but not the issue in this case.

                               

                              At this point, I'm trying to understand whether points 1-4 are correct from a hardware perspective. Because if they are, this means that there is a race between writes of adjacent flags by different cores (since each flag is placed on a single byte and the flush writes 32 bytes)

                              • 12. Re: L1 Cache and MPB behavior
                                jheld

                                Your points 1-4 and the discussion above seem to reflect a confusion about the WCB and its role.

                                It is not part of the cache, it is only combining writes to the same cacheline where possible.

                                It is not involved in reads, so #2 and #3 are wrong.

                                 

                                2. The entire cache line (from: RCCE_comm_buffer[2] + offset to RCCE_comm_buffer[2] + offset + 32) is brought into the L1 cache (Specifically to the WCB) from the MPB memory

                                 

                                WCB is only involved in writes.


                                3. The first byte of the cache line is modified in the WCB

                                Adjacent writes (adjacent in time) to the same cacheline (adjacent in space) are combined in the WCB,  Nothing is "modified" in the WCB.

                                 

                                 

                                The WCB only affects writes, buffering writes of less than a full cacheline to speed transfers over the memory bus.

                                Since the P54c only has one WCB, a write to a different cacheline causes the exisitng line to be emitted immediately and that new write to be buffered.

                                 

                                Nothing is ever read from the WCB or put into the WCB on read from memory.

                                -Jim

                                • 13. Re: L1 Cache and MPB behavior
                                  tedk

                                  long offset = (long)p - (long)RCCE_comm_buffer[ID];

                                  *(char *)(RCCE_comm_buffer[2] + offset) = 'A';

                                   

                                  The entire cache line (from: RCCE_comm_buffer[2] + offset to RCCE_comm_buffer[2] + offset + 32) is brought into the L1 cache (Specifically to the WCB) from the MPB memory

                                  3. The first byte of the cache line is modified in the WCB

                                  Sam, why do you think the entire cache line is brought into L1? The L1 is write-back, not write-allocate. So just assigning a value to a memory location does not bring that location into the cache.

                                   

                                  If that RCCE_comm_buffer is not in L1 then you don’t go directly to memory when you write it. You go instead to the WCB. What the WCB wants to do is collect an entire cache line before writing to memory. And it will do that if you are doing consecutive writes.

                                   

                                  But then if you start to write to another location that is not in the cacheline that the WCB is buffering, the partially-filled WCB goes to memory (not the cache). This is a technique that I think happens inside RCCE. RCCE writes to the WCB; the WCB is not filled; but RCCE wants what is in the WCB to go to memory. So RCCE writes junk to a fool buffer (a different cacheline), which forces the WCB to memory. It doesn’t force it to the cache. The WCB is independent of the cache.

                                   

                                  Nothing gets modified in the WCB. Stuff gets put into the WCB. You may be thinking of putting something into the WCB as modifying; but once something is in the WCB, that’s the value that’s there; there is no modification.

                                   

                                  Now I think one of your questions is .. Exactly what happens in a non-burst mode transfer to memory? I think the burst mode happens when you are writing an entire cache line to memory. If you write out the WCB to memory before it is filled (and this occurs if you start to write to a location in a different cacheline that is in the WCB or to non-MPBT memory), this is a non-burst transfer to memory.

                                   

                                  It is true that the result of a non-burst transfer can be a byte transfer to memory. Values are not going to get stuck in the WCB.

                                  • 14. Re: L1 Cache and MPB behavior
                                    samv

                                    Thanks Ted and Jim!

                                     

                                    My main concern was that it was not possible to actually write to the MPB at a byte granularity as cache flushes are usually done at a line granularity.

                                     

                                    But this clears it up - the fool WCB technique actually bypasses the L1 entirely and forces a direct write of the byte to the memory.