1 of 1 people found this helpful
I can comment on your first question. Yes, the registers are memory-mapped and can be accessed by all cores. The functions RCCE_acquire_lock() and RCCE_release_lock() defined in RCCE_admin.c show how locking is implemented on top of these registers.
1 of 1 people found this helpful
I'm not sure what you mean by
what is the atomic test&set machine operation? How such a command benefits from semantics of test&set register?
As Andreas said, we access the test&set registers through memory-mapped I/O, using mmap() through the device /dev/rckncm. What more do you want to know beyond that? Are you running on baremetal or on SCC Linux? Are the test&set registers giving you the functionality and performance that you need?
Thanks for your useful answers. I want to know apart from shared atomic test and set registers, do we have any other atomic objects?
And about atomic operations in the instruction set, I want to know is there any atomic swap operation (or fetch and store) available? What about compare and swap (or compare and exchange)? They could be useful to implement some concurrent data structures where might test and set registers are not enough. Plus I haven't started implementation on the platform yet.
You might find these two threads helpful:
Compare and Exchange does not work as you would expect on SCC.
Those other discussions are a good suggestion. Overall think cluster, not shared memory, programming. Atomic instructions work with respect to single-core execution, e.g. an interrupt won't divide a CMPXCHG, but LOCK to hold off another core is not intended to work. Use a message to synchronize, not a memory location.
Thanks for your useful comments. I understand your comment about have a message passing look on the platform rather than shared memory. But I am considering usage of off chip shared memory in case applications want to use it. Having this in mind, I have the following questions:
1-So you mean it is not possible to provide synchronization (e.g. locks) to access shared off-chip memory using 48 TNS registers? They are synchronizing access to shared on chip memory, so why not do the same for shared off chip memory. In other words why TNS locks should be used only for implementation of message passing layer?
2-So in case an application needs to use some shared data structures on off chip shared memory and we avoid using the TNS register to implement locks, how should one control mutex and synchronization to access them? A distributed lock built on top of message passing maybe?
The TNS bits can be used for whatever you like - protecting access to on-die memory, off-die memory, anything that an atomic test and set is useful for. They are just a globally visible set of bits that are TNS. I mention messaging, because since there are only 48, so a layered solution seems appropriate. SCC messaging support is agnostic to on-die, off-die.
Thanks for your clear answer. Another question:
Does the current implementation of locks in messaging library requires that cores spin on local on-chip memory of other cores? In this case wouldn't be a bottleneck in front of scalability of these locks since they consume interconnect traffic and also could create high contention on memory mapped TNS registers?
I can't speak to the specific implementation on RCCE (if that is what you mean).
Polling is limited by the throughput of the cores which is much less than the mesh can sustain.
Polling requires traffice but is very low latency. I'd expect if a system is doing work and load balanced then the period of polling would be brief.
Async is also possible. Interrupts also burn power and have high latency. The best design point in the tradeoff will depend on the nature of your workload.