I was trying to use it but it would give me a segmentation fault. However, I tried to do it without having any cachable shared memory allocated, as I was remapping LUT's myself and simply wanted to flush the L2 cache. I guess the driver didn't like that, it gave me a kernel oops:
<1>Unable to handle kernel paging request at virtual address 801092e0
*pde = 00000000
Oops: 0000 [#4]
EIP: 0060:[<c01dd266>] Not tainted VLI
EFLAGS: 00010282 (2.6.16-Rev377_modified_unchecked #1)
EIP is at rckdcm_purgeAddress+0x16/0x30
eax: 000002e0 ebx: 103692e0 ecx: 00000003 edx: 801092e0
esi: 00000020 edi: 080892e0 ebp: 00000224 esp: cef1ff48
ds: 007b es: 007b ss: 0068
Process lutmap (pid: 864, threadinfo=cef1e000 task=d103d070)
Stack: <0>000002e0 c01dd34f 103692e0 00000080 c319b6e0 00000004 cf1fc0c0 080892e0
cef1ffa4 c01434ea cf1fc0c0 080892e0 00000004 cef1ffa4 cf1fc0c0 fffffff7
b7f57800 cef1e000 c014360d cf1fc0c0 080892e0 00000004 cef1ffa4 00000000
Code: ff ff 5a 59 eb ec eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 50 8b 54 24 08 b9 03 00 00 00 81 e2 ff ff 10 80 81 ca 00 00 10 80 <8a> 02 81 c2 00 00 01 00 49 88 44 24 03 79 f1 58 c3 89 f6 8d bc
Instead I've been flushing my L2 cache with my own routine from userspace. After looking at the rckdcm code, I think it's more efficient as well. It flushes the cache in about 1.2 million cycles. (on a 533 MHz core/800 MHz network configuration). However, it is difficult to verify that it is 100% correct as we can't inspect the cache memory.
We haven't been successful with a routine in userspace. Can you share yours?
About the seg fault ... it just sounds as if you are not using the correct linux obj. The one with the working flush routine is not the default.
You are 1.4.0, I think. Please look in http://marcbug.scc-dc.com/svn/repository/trunk/CustomSCCLinux/
The line in rckmem.c must have PHYSICAL_START to avoid the seg fault.
245 dummyData = (volatile char*)(__va(__PHYSICAL_START+set));
We are still looking at an anomaly with the flush routine in rckmem.c. We see about 10 to 20 errors every million times we run it ... as described in the bug. We are currently evaluating a proposed fix for this.
Ted Kubaska wrote:
We haven't been successful with a routine in userspace. Can you share yours?
Yes, I have cleaned up my code now, put it into separate files and attached it to this post. The implementation is very simple, I mmap a block of 256K from LUT entry 0xff as I assume that it is never used after boot (and to guarantee that it is one consecutive block in physical memory). The 'flushing' routine then reads a single byte in each of the 8192 cache lines in the L2 cache in a very efficient loop that compiles to only a few assembly instructions, which even works with high (-O3) compiler optimisations e.g.:
erase_cache: pushl %ebp movl %esp, %ebp movl eraseblock, %ecx movl %ecx, %eax addl $262144, %eax .L7: movb (%ecx), %dl movb 32(%ecx), %dl movb 64(%ecx), %dl movb 96(%ecx), %dl subl $-128, %ecx cmpl %ecx, %eax jne .L7 popl %ebp ret
As it is impossible to implement real cache flushing for the L2 cache, I refer to it as erasing the cache - you don't invalidate it, you don't flush it, you simply erase (and evict) anything that is in it and replace it with other data.
Usage is pretty much self-explanatory; include userspace_flush.h, and compile/link userspace_flush.c with your program. Call init_eraseblock() in the initialisation phase of your program, and call erase_cache() whenever you want to clear your L2 cache. Optionally, if you want to clean up properly you can call cleanup_eraseblock() at the end of your program to unmap the memory again.
In my measurements (on a 533 Mhz core / 800 Mhz router configuration), it takes between 1 (clean) and 1.6 (dirty) million cycles to erase the cache. The RCCE DCMflush() routine took between 2 and 5 million cycles in my tests. It seems the more often you call DCMflush(), the faster it becomes (starting at 5 mln, dropping slowly to 2mln). Even between program instances, excluding L1 I-cache effects, which made me wonder; does Linux smartly cache or optimise often called system calls?
I am interested in any feedback on these routines. It seemed very straightforward to me, so perhaps there is some less-obvious problem that I've missed.
Thanks. We'll try out this code. And discuss it with Werner who wrote the flush routine we are currently using. And we have a couple of other senior people trying to puzzle out the best way to use L2 on SCC.
A matter of definition ... why do you say what you have is not a flush routine? We think of invalidating as not writing anything back to memory, but requiring that when the core does access a location it goes to memory to get it, not the cache. We think of flushing as moving the contents of the cache out to memory. Why is this evicting of the cache not called flushing?
This is what DCMflush() does ... replace every location in the cache with something else thus forcing the contents of the cache back out to memory so other cores can get at the data. This is not the most efficient way to do things of course, but given the lack of an L2 flush instruction on SCC, it's what we have.
And as I said we do see some errors (10 or 20 every million times we run the test) that we are tracking down. We don't think this is a coding error but a result of our need to understand the details of the SCC architecture correctly. We think this may have something to do with L1/L2 interaction.
I dont know what you mean by "smartly cache or optimize often called sysetm calls" in this context.
Actually I have been looking at the test code from the Marcbug #195 that you've referenced above. I cleaned it up considerably and used it to test both my implementation and the DCMflush() routine. However, with both flushing methods I get errors. Actually I've changed the code to single-writer multiple-reader, and with more then 2 processes the chance of reading incorrect data increases dramatically. I'm still trying to get to the bottom of this, when I have some news I'll also share it on the Marcbug discussion.
I have just posted some of my findings and new test code on the Marcbug post. Unfortunately I have to confirm that my userspace flush posted above _DOES NOT WORK PROPERLY_. In theory it does, but it seems to suffer from exactly the same problems as the DCMflush() implementation, which is mentioned in the opening post here.
There's been much history to this project. We went though a period where it passed initial tests, but if we ran very long tests, we saw a small number of failures.The tests started out on sccKit 1.3.0, but then moved to sccKit 1.4.0. Werner Haas and Michiel van Tol were key contributors to the flush routine and its tests. We were about to declare victory when Vivek Subramanian worte a test program that reported errors right away.
About this time the symposium in Germany was just starting up, and attention shifted to its preparation. We then discovered that 1.4.0 was exhibiting some instability ( http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=264 ). Cores would lose and then regain connectivity, but we could restore stability by disabling the eMAC interface. In the Data Center, we focused on characterizing this issue and disabling eMAC on those systems that needed it. sccKit 1.4.1 (with its two patches, it's called 184.108.40.206) fixed Bug 264. We are now focusing on an issue with ssh on 220.127.116.11 that prevents RCK MPI from running succesfully.
There's been a significant change in the SCC Linux when going to 1.4.1 from 1.4.0. The cores now run a modern kernel (18.104.22.168). Although the rckmem.c (that contans the flush routine) is part of the 22.214.171.124 build, no one has tested it so far in this evironment. Efforts focused instead on stabilizing that environment.
Consequently, testing the cache flush routine stalled. If you can help that would be great.