9 Replies Latest reply on Aug 2, 2011 5:18 PM by tedk

    Are you using the new L2 cache flush routine?


      If you are, please look at bug 195 http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=195.


      The original flush test routnie posted here was flawed, but we think the flush routine is still working correctly. And we'd be interested if people are using it successfully.

        • 1. Re: Are you using the new L2 cache flush routine?

          I was trying to use it but it would give me a segmentation fault. However, I tried to do it without having any cachable shared memory allocated, as I was remapping LUT's myself and simply wanted to flush the L2 cache. I guess the driver didn't like that, it gave me a kernel oops:


          <1>Unable to handle kernel paging request at virtual address 801092e0
          printing eip:
          *pde = 00000000
          Oops: 0000 [#4]
          CPU:    0
          EIP:    0060:[<c01dd266>]    Not tainted VLI
          EFLAGS: 00010282   (2.6.16-Rev377_modified_unchecked #1)
          EIP is at rckdcm_purgeAddress+0x16/0x30
          eax: 000002e0   ebx: 103692e0   ecx: 00000003   edx: 801092e0
          esi: 00000020   edi: 080892e0   ebp: 00000224   esp: cef1ff48
          ds: 007b   es: 007b   ss: 0068
          Process lutmap (pid: 864, threadinfo=cef1e000 task=d103d070)
          Stack: <0>000002e0 c01dd34f 103692e0 00000080 c319b6e0 00000004 cf1fc0c0 080892e0
          cef1ffa4 c01434ea cf1fc0c0 080892e0 00000004 cef1ffa4 cf1fc0c0 fffffff7
          b7f57800 cef1e000 c014360d cf1fc0c0 080892e0 00000004 cef1ffa4 00000000
          Call Trace:
          [<c01dd34f>] rckdcm_write+0xcf/0x100
          [<c01434ea>] vfs_write+0x7a/0xf0
          [<c014360d>] sys_write+0x3d/0x70
          [<c01025c9>] syscall_call+0x7/0xb
          Code: ff ff 5a 59 eb ec eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 50 8b 54 24 08 b9 03 00 00 00 81 e2 ff ff 10 80 81 ca 00 00 10 80 <8a> 02 81 c2 00 00 01 00 49 88 44 24 03 79 f1 58 c3 89 f6 8d bc


          Instead I've been flushing my L2 cache with my own routine from userspace. After looking at the rckdcm code, I think it's more efficient as well. It flushes the cache in about 1.2 million cycles. (on a 533 MHz core/800 MHz network configuration). However, it is difficult to verify that it is 100% correct as we can't inspect the cache memory.

          • 2. Re: Are you using the new L2 cache flush routine?

            We haven't been successful with a routine in userspace. Can you share yours?


            About the seg fault ... it just sounds as if you are not using the correct linux obj. The one with the working flush routine is not the default.


            You are 1.4.0, I think. Please look in http://marcbug.scc-dc.com/svn/repository/trunk/CustomSCCLinux/

            The line in rckmem.c must have PHYSICAL_START to avoid the seg fault.

            245   dummyData = (volatile char*)(__va(__PHYSICAL_START+set));


            We are still looking at an anomaly with the flush routine in rckmem.c. We see about 10 to 20 errors every million times we run it ... as described in the bug. We are currently evaluating a proposed fix for this.

            • 3. Re: Are you using the new L2 cache flush routine?

              Thanks Ted, indeed I was using the default kernel of the 1.4.0 sccKit release. With the kernel you suggested the DCMflush() routine works perfectly. My userspace routine seems to be a factor 5 faster, however, I want to test it a bit more thoroughly to be more confident about it before sharing.

              • 4. Re: Are you using the new L2 cache flush routine?

                Ted Kubaska wrote:


                We haven't been successful with a routine in userspace. Can you share yours?

                Yes, I have cleaned up my code now, put it into separate files and attached it to this post. The implementation is very simple, I mmap a block of 256K from LUT entry 0xff as I assume that it is never used after boot (and to guarantee that it is one consecutive block in physical memory). The 'flushing' routine then reads a single byte in each of the 8192 cache lines in the L2 cache in a very efficient loop that compiles to only a few assembly instructions, which even works with high (-O3) compiler optimisations e.g.:



                     pushl     %ebp
                     movl     %esp, %ebp
                     movl     eraseblock, %ecx
                     movl     %ecx, %eax
                     addl     $262144, %eax
                     movb     (%ecx), %dl
                     movb     32(%ecx), %dl
                     movb     64(%ecx), %dl
                     movb     96(%ecx), %dl
                     subl     $-128, %ecx
                     cmpl     %ecx, %eax
                     jne     .L7
                     popl     %ebp


                As it is impossible to implement real cache flushing for the L2 cache, I refer to it as erasing the cache - you don't invalidate it, you don't flush it, you simply erase (and evict) anything that is in it and replace it with other data.


                Usage is pretty much self-explanatory; include userspace_flush.h, and compile/link userspace_flush.c with your program. Call init_eraseblock() in the initialisation phase of your program, and call erase_cache() whenever you want to clear your L2 cache. Optionally, if you want to clean up properly you can call cleanup_eraseblock() at the end of your program to unmap the memory again.


                In my measurements (on a 533 Mhz core / 800 Mhz router configuration), it takes between 1 (clean) and 1.6 (dirty) million cycles to erase the cache. The RCCE DCMflush() routine took between 2 and 5 million cycles in my tests. It seems the more often you call DCMflush(), the faster it becomes (starting at 5 mln, dropping slowly to 2mln). Even between program instances, excluding L1 I-cache effects, which made me wonder; does Linux smartly cache or optimise often called system calls?


                I am interested in any feedback on these routines. It seemed very straightforward to me, so perhaps there is some less-obvious problem that I've missed.

                • 5. Re: Are you using the new L2 cache flush routine?

                  Thanks. We'll try out this code. And discuss it with Werner who wrote the flush routine we are currently using. And we have a couple of other senior people trying to puzzle out the best way to use L2 on SCC.


                  A matter of definition ... why do you say what you have is not a flush routine? We think of invalidating as not writing anything back to memory, but requiring that when the core does access a location it goes to memory to get it, not the cache. We think of flushing as moving the contents of the cache out to memory. Why is this evicting of the cache not called flushing?


                  This is what DCMflush() does ... replace every location in the cache with something else thus forcing the contents of the cache back out to memory so other cores can get at the data. This is not the most efficient way to do things of course, but given the lack of an L2 flush instruction on SCC, it's what we have.


                  And as I said we do see some errors (10 or 20 every million times we run the test) that we are tracking down. We don't think this is a coding error but a result of our need to understand the details of the SCC architecture correctly. We think this may have something to do with L1/L2 interaction.


                  I dont know what you mean by "smartly cache or optimize often called sysetm calls" in this context.

                  • 6. Re: Are you using the new L2 cache flush routine?

                    Actually I have been looking at the test code from the Marcbug #195 that you've referenced above. I cleaned it up considerably and used it to test both my implementation and the DCMflush() routine. However, with both flushing methods I get errors. Actually I've changed the code to single-writer multiple-reader, and with more then 2 processes the chance of reading incorrect data increases dramatically. I'm still trying to get to the bottom of this, when I have some news I'll also share it on the Marcbug discussion.

                    • 7. Re: Are you using the new L2 cache flush routine?

                      I have just posted some of my findings and new test code on the Marcbug post. Unfortunately I have to confirm that my userspace flush posted above _DOES NOT WORK PROPERLY_. In theory it does, but it seems to suffer from exactly the same problems as the DCMflush() implementation, which is mentioned in the opening post here.

                      • 8. Re: Are you using the new L2 cache flush routine?

                        I read bug 195. It is not updated after June 21th.


                        Now it is working correctly?


                        Can I use it?


                        Where is the source?



                        • 9. Re: Are you using the new L2 cache flush routine?

                          There's been much history to this project. We went though a period where it passed initial tests, but if we ran very long tests, we saw a small number of failures.The tests started out on sccKit 1.3.0, but then moved to sccKit 1.4.0. Werner Haas and Michiel van Tol were key contributors to the flush routine and its tests. We were about to declare victory when Vivek Subramanian worte a test program that reported errors right away.


                          About this time the symposium in Germany was just starting up, and attention shifted to its preparation. We then discovered that 1.4.0 was exhibiting some instability ( http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=264 ). Cores would lose and then regain connectivity, but we could restore stability by disabling the eMAC interface. In the Data Center, we focused on characterizing this issue and disabling eMAC on those systems that needed it.  sccKit 1.4.1 (with its two patches, it's called fixed Bug 264. We are now focusing on an issue with ssh on that prevents RCK MPI from running succesfully.


                          There's been a significant change in the SCC Linux when going to 1.4.1 from 1.4.0. The cores now run a modern kernel ( Although the rckmem.c (that contans the flush routine) is part of the build, no one has tested it so far in this evironment. Efforts focused instead on stabilizing that environment.


                          Consequently, testing the cache flush routine stalled. If you can help that would be great.