I know CL1INVMB instruction. CL1INVMB invalidates only L1 cache lines that are MPBT type.
My question was that whether the core of SCC can invalidate a specific cache line in L1 and L2 cache. For example, if I want to flush cache lines corresponding to the address 0x1000, how can I do that? I need to invalidate only some lines in L2, not all cache lines.
Can you tell me what "an appropriate pattern of read" is?
I read 256KB, 512KB, 1MB of data, but the entire cache is not evicted.
The pattern of read I did is attached.
The last read from core 1 should be all '7', but some data are '2'.
I think this happens because some of the cache's data from core 0 are not successfully evicted.
Since L2 cache has pseudo-LRU replacement policy, it could happen.
Then, my question is what is the perfect method to replacement all of the cache data?
cache access.png 63.9 K
The L2 cache is 4-way set asscociative and write-back. Its replacement policy is pseudo-LRU. The policy is described in the file "How the SCC L2 cache Works." This pseudo-LRU policy is not random, and so you should be able to evict all the data in the core's cache by reading in an amount equal to the size of the cache.
Other users and internal Intel engineers as well have expressed interest in a realiable way of evicting L2. We will continue to look into that. The current shared-memory implementation on the SCC is uncached. Uncached shared memory is useful for proof-of-concept, but cacheable shared memory is necessary for reasonable performance.
Please continue to look into this issue with us and share your results. Is working with an enabaled L2 essential to your research?
Thank you for providing detal information about L2 cache.
I read the document, but it seems to flush all the cache lines when I read 256KB data sequentially.
Is LRU bits of each set changed only when a replace occurs?
If it is right, reading contiguous 256KB should evict all the cache lines. (reading unit(1B, 2B, or 4B) is not important)
Is there any action happens when a cache hit occurs?
P.S. I got some errata.
"in the core's lookup table(LYT)" in the second page.
In the Figures, you wrote "LRU" bits in the core address.
I think it is LUT index. Why do you use term "LRU" for it?
Yes, I meant LUT index ... sorry ... I'll update the document. The operation is correct as described but when you convert the core address into a system address you are indexing into the LUT, and the correct name for the field is LUT(8).
I don't know what you mean by SCC_SVM=1 /shared/junghyun/pthread-test.
Yes, reading in 256K should evict all cache lines for a core. This is the conclusion of our internal discussion. I do not know why it does not work for you. However, we have not actually tried it out ourselves. We will try it out, but I wanted to get this preliminary information to you as quickly as possible. We do not have the resources to write a cache flush routine immediately, but there is interest here in doing such a program.
You can see my code in marc016.
check out /home/junghyun/SCCTest/L2_flush
How to run?
login each console rck00 and rck01
rck01:~> L2_flush 8
rck00:~> L2_flush 8
then it will print some strings.
If it prints the string below, it is failed to flush all cache lines
got wrong value.. data_array.word = 6
I think it should not print the string always, but sometimes I got it.
Please investigate my code and figure out what is the problem.
One more thing, there is no interface to map memory region above 0x14000000 with WB page, so I just made /dev/rckmem.
So, if you want to run it, boot with rcklinux.obj in the L2_flush directory first.
It's not easy to explain all the code here, just check out my code and run it. and tell me what is the problem.
I don't know what you mean by "no interface to map memory region above 0x14000000 with WB page" ... we should not need a special SCC Linux to test flushing L2. I understand that you might want a special Linux for your own research, but just to check that we can flush L2, that should not be necessary.
Yes, it could be. However, as you said, L2 cache is doing physcal-address indexing.
Then, if you want to flush L2, make sure you should read physically contigous 256KB.
If you just make a big array or malloc, you cannot guarantee the memory area is physically contigous. That only guarantees the memory area is virtually contiguous.
If you cannot guarantee the memory area is physically contiguous, then some data in the array maps to the same index of the L2 cache. It means there should be a cache set which is not actually evicted (even not accessed).
For example, If you malloc 8KB memory, then, the kernel gives you two pages. The page you will get is known when you really writes to the page. Let's assume the physical page addresses of the two pages are 0x4000 and 0x14000, respectively. The index of 0x4000 for L2 cache is 0x200. The index of 0x44000 for L2 cache is also 0x200.
Then the two virtually contiguous pages are accessing the same sets of L2 cache. It could be a problem.
So, you make sure the virtually contigous pages are physically contiguous. Or, at least you make sure the virtual memory area 256KB is spreaded all over the L2 cache area. I think the first one is easier.
Then, what should we do? Mmaping the area is the easier way to do this. I tried to find to do it with the existing linux system, but I didn't find. Please let me know if you know. I tried mmap with /dev/mem, but it failed because the area was UNCACHEABLE. The reading 256KB twice is almost twice of the reading 256KB once. So, I just added another /dev/rckmem for doing this.
Does it make sense to you? If you know another idea to flush L2, then please share the idea!
Small detail: you do not need to read 256KB of contiguous memory to flush the entire L2, it is enough to read 256KB/W memory W times, where W is smaller than or equal to the number of ways (associativity) of the cache, and 256KB mod W = 0. If my conjecture is correct that the replacement policy is LRA (Least Recently Allocated) instead of LRU, then doing the reads as I just indicated will flush the L2. However, if you are not sure of the number of ways, then allocating 256KB contiguously is easier.
I see some risky things in your code, which could be avoided if you used RCCE and rccerun to run your experiments. It is not said that these caused the behavior you saw, but it would be good to rule them out.
1. The contents of MPB is not guaranteed to be clean. In fact, if you do not explicitly wipe if before a run, old data will stay in it. rccerun clears the MPBs before a run. if you run your code twice without cleaning, MPB will contain "value" the second time around, so core 1 will check whether the flush worked right away, and is virtually guaranteed to find errors.
2. Core 0 does not invalidate the MPB before it writes "value" to mpb. This write is then not guaranteed to go to memory.
3. I could not confirm correctness of your scheme to allocate off-chip cacheable shared memory. It may well be correct, but it would take too much time for me to check.
4. You do not ensure the core that reads the shared data and checks for correctness has also flushed its cache first--almost a chicken and egg situation, but not quite.
Here is my suggestion. You don't need to do this at all, but if you do, I can help you trace the error--or confirm that the flush actually does work.
1. Use RCCE (RCCE_shmalloc) and rccerun to implement your test and to run it. You would need to select device /dev/mem instead of /dev/rckncm for the shared memory allocation to make sure it is cacheable.
2. Use RCCE_send/RCCE_recv for synchronization of the cores ('wake up core 1").
3. Flush the reading core's cache before it checks the shared data.
A simple sanity check would be to first check for correctness with /dev/rckncm (the default for RCCE_shmalloc)