Thanks. What version of RCCE are you using? Did you take your RCCE from our SVN trunk or are you using a released tag?
Currently, shared memory on the SCC suffers by not being cacheable. Making it cacheable has proved diffficult because users have so far not been successful in flushing the L2 cache. It should be possible to reliably flush L2.
I'm using the code from the trunk compiled with -DSHMADD. By the way, if we don't compile with -DSHMADD, shouldn't we be adding an offset to the beginning of shared memory to avoid clashing with the kernel? I'm thinking of something like RCCE_shmalloc_init(RC_SHM_BUFFER_START() + 0x2C0000, RCCE_SHM_SIZE_MAX) for example. Could you take a look at the corresponding code in RCCE_admin.c?
Yes. If you do not use -DSHMADD and you access shared memory, you run the risk of clashing with the kernel.
The post http://communities.intel.com/message/101172#101172 by Michael Riepen specifies exactly what shared memory is used by the system.
You should be able to avoid this clash by modifying line 330 in RCCE_admin.c from
RCCE_shmalloc_init(RC_SHM_BUFFER_START() + 0x2C0000,RCCE_SHM_SIZE_MAX);
as you pointed out. But I haven't tested this. I'll run a couple of simple tests and then change the trunk as you suggested.
Note that without SHMADD, you are limited to 64MB of shared memory, and with the new start you suggest, this limit is actually 64MB-0x2C0000.
Note also that the code under SHMADD is still in the trunk and not part of a released tag. We became aware of the problem you pointed out after the latest RCCE release. Code in the trunk is the very latest we make available and has not yet gone through enough evaluation and testing to be tagged as a release. There's no guarrentee that what appears in the trunk will also appear in a release.
It seems to me that the only way to safely use shared memory is to avoid the first 64MB altogether. If I'm reading Michael's post correctly, no offset less than 0x32BFFFF can really guarantee that we won't end up writing to memory that is already in use by the system. So setting up more than 64MB of shared memory seems like the best option.
I think you can safely use shared memory in the lower 64MB if you ensure that you skip over the areas used by the system by changing the start address. I like to just ignore the first 64MB because it is easier, but that is not necessary. Unfortunately, the memory used by the system is not contiguous. But you can set the start of your shared memory to be greater than the largest shared memory address used by the system.
The format in Michael's post is allocated memory @ offset. In his post, shared memory is identified in LUT slots 0x80 through 0x83.
The range for a is 1540KB; the range for b is 768KB and that for c is 384KB.
The @ 0MB for a means that those lower 24 bits in the physical address are 0.
The @ 2MB for b means that the lower 24 bits are 20 0000 = 2**21 = 2MB
For c, the lower 24 bits are 0x190000 so the offset is @ 1600KB
a. SHM TTY1 & Perfmeter 8000 0000 - 8018 0FFF (1540KB @ 0MB)
SHM TTY2 8100 0000 - 8118 0FFF (1540KB @ 0MB)
SHM TTY3 8200 0000 - 8218 0FFF (1540KB @ 0MB)
SHM TTY4 8300 0000 - 8318 0FFF (1540KB @ 0MB)
b. rckpc (Host network): 8020 0000 - 802B FFFF (768KB @ 2MB)
8120 0000 - 812B FFFF (768KB @ 2MB)
8220 0000 - 822B FFFF (768KB @ 2MB)
8320 0000 - 832B FFFF (768KB @ 2MB)
c. rckmb (on-chip network) 8019 0000 – 801E FFFF (384KB @ 1600KB)
So when we look for shared memory taken from LUT slot 0x80 (it's the same for 0x81, 0x82, and 0x83, because it's shared), the memory taken by the system looks like the following ...
(1540KB) (384KB) (768KB)
0 .... a.....++....unused ....+....c....++....unused....++....b....++
|| | || || |+2816KB
|+1540KB +1600KB |+1984KB |+2048KB +2816KB-1
And then if you offfset the start of your shared memory by 2816KB or 0x2c0000, you should avoid the memory used by the system.
currently I am working with the extern shared memory on the SCC and was wondering, why the processes get all an different start address on the extern SHM when allocating data on it with RCCE_shmalloc. First I thought the extern SHM is addressed in a physical way by the RCCE library and the access on it can be accomplished with simply passing the proper pointer to the UEs... but that dont work. I have to pass the offsets around the UEs. This is because of my program structure, just one process is managing the extern SHM. I have read the RCCE specification, so I knew about the offset approach. I would be very thankful, if someone can explain me why the pointer approach cant be used? As I already mentioned above, I thought the extern SHM is directly addressable by the physical addresses, but that cant be the case.
Any help would be appreciated.
You can't using pointer approach, because the virtual address is not valid for other cores. The SCC is like cluster on chip and it is not follow real shared memory model. That means map shared memory are into every process on the system at different virtual address. So, you need using techniques of cluster programming model to passing data from core to other.
thank you for your answer.
I am still interested in the details, why no physical addressing scheme is used or better can't be used.
If you check the "SCC External Architecture Specification (EAS) ver. 1.1, chapter 11.1 Lookup Table Defaults" document or read Teds 5th post on this topic, you will see, that the LUTs address space for the off chip shared is identified in LUT slots 0x80 through 0x83.
Ok, let me explain it in another way. I thought a process running on a core can do an memory access directly through the "LUT", that is supported by RCCE. But I think there should be the restriction through the operating system that "catches" memory accesses that are not in the scope of the process address space. I some how thought, that the modified Linux-kernel running on the SCC may considered this point, so the cores can directly access the off chip SHM slots directly by the physical addresses in the LUT. This is just an assumption, that I had about the access on the off chip SHM.
I'm not sure I can answer the "why" from a hardware design perspective. The fact is that every process has to map the memory into its address space, in order to share it, and these mappings may start at different virtual addresses, like Hayder said above. If you pass a pointer from core A to core B, chances are very good that it's invalid. mmap allows you to create a mapping at a fixed virtual address. But I don't know, I haven't tried this...
Hmm, I'm pretty sure that doesn't answer your question. Maybe ask Jan-Arne Sobania? He's doing a lot of kernel work.
Hayer and Andreas are correct. The current SCC programming model is best described as "cluster-on-a-chip"; that is, you have a set of independent processor cores each running their own operating system kernel. The kernels do not know of each other, and no part of system state is shared or synchronized across the chip. Its just (up to) 48 Linuxes running in parallel. From the (or, better, each) kernel's point of view, it is running on a standard x86 single-core processor (P54C).
RCCE now provides access to the "non-standard" x86 features. However, that happens entirely in user mode, without any special kernel integration. The only time the kernel is invoked is for mmap'ing a certain device file, which in turn just results in mmap'ing specific physical addresses into the calling task's address space. This is essentially identical to just writing some page tables; the kernel does not know of any specific semantics for these entries.
The LUTs are one special SCC feature, as they allow something that is not possible on off-the-shelf x86 systems. They are essentially another layer of memory management, below the traditional x86 MMU. When looking at the usual x86 address translation (32-bit protected mode, paging enabled, no PAE), you get something like this (simplified):
- You start with a 48-bit "logical" address, which is divided into a 16-bit segment selector and 32-bit offset. 3 bits of the 16-bit segment selector specify the privilege level and descriptor table, the rest determine the descriptor index. The descriptor contains, among other things, the "base" "linear" address of the memory segment.
- After checking the the selector is valid and the offset is within the segment limit (offset >= descriptor limit results in a general protection fault exceptions), adding the 32-bit part of the "logical" address to the "base" address from the descriptor yields a 32-bit "linear" address.
- The 32-bit "linear" address is translated using the page-tables to a 32-bit "physical" address.
- The memory access is performed using the 32-bit "physical" address.
On the SCC, there is also step 5 (again, simplified; I'm leaving out the caches): the 32-bit "physical" address is converted, using the LUTs, into a 46-bit "system" address. Part of this "system" address also specifies which component is responsible for handling the memory access; i.e., it contains the target coordinates for the packet that will be sent over the on-die mesh network.
The Linux kernel does not know of the LUTs, or the last step of address translation. It just manages the memory it knows about, just as it would on any x86 system. To use the remapping capabilities of the LUTs in an application, you basically need to do this (please note that this is independent of whether you are using RCCE or doing everything from scratch):
- Reserve some range in your task's virtual memory space.
- Instruct the kernel to map those virtual addresses to certain physical addresses.
- Change the LUTs for those special physical addresses. Any following access to your virtual addresses will still result in them being translated to the same physical addresses, but the LUT will redirect the access to the new target.
Step 1 is required because all memory accesses happen in virtual memory. There is no means to selectively disable page-table translation for certain accesses, so you always need to set up some page tables; even if those entries are just dummies and you want to perform the real work by reconfiguring LUTs...
The idea here is that the Linux kernel only performs steps 1 and 2, and does not need to be involved afterwards. Once the mappings are set up, you can (e.g., via RCCE) reconfigure the LUTs yourself, "behind the back" of the kernel.
Finally comming back to the original question: how to do linked data structures in shared memory. I can think of two variants:
- Allocate a "physical remapping" area (i.e., a set of LUT slots) per core. Map in this range, on all cores, at the same virtual address. Configure the LUTs to point to the same system memory. Now, you can exchange virtual addresses freely across cores (as long as they refer to objects in this special shared section).
- Allocate a "physical remapping" area (i.e., a set of LUT slots) per core. Map in this range, on all cores, at arbitrary virtual addresses. Configure the LUTs to point to the same system memory. You cannot exchange virtual address across cores. As an alternative, use offsets and let each core add the base (virtual) address of the mapping.
It may also be possible to hide the offset arithmetic of variant 2 behind some compiler syntax, if you want to write your own compiler (or modify an existing one). As an idea, the Microsoft C compiler supports the "_based" specifier for pointers; such pointers behave like normal ones from a programmer's point of view, but when stored to memory, they are just offsets, relative to the specific "base" address. Unfortunately, it seems GCC does not support this concept yet.
First of all, Thanks for this nice post.
>>Allocate a "physical remapping" area (i.e., a set of LUT slots) per core. Map in this range, on all cores, at the same virtual address. Configure the LUTs to point to the same system memory. Now, you can exchange virtual addresses freely across cores (as long as they refer to objects in this special shared section).<<
I have thought in advance about this idea, but I got impression that's a poor and rather restrictive solution.
Could you give us more details about it? if possible with examples.