Bruno, Got this answer from one of our technologists .. if you need something deeper, I'll have to get him on line.
Typical "MESI" cache protocol applies here: When data is shared between two cores, the cache line will be in an "s" state (shared). When one core wants to write the data, it first needs to invalidate the data in the other core's cache, then write the data in its own cache. This applies to if the independent of if the two cores are a "pair" or not. It also applies to both L1 as well as L2 when the L2 is not shared by the cores in question.
Thanks, Chris, it makes a lot of sense.
But, those this mean that the invalidated cache line will have to be updated from RAM, in every situation ?
If so, this means that multi-cores have still to be optimized for thread data dependence.
Since, a write to a shared datum triggers a miss all the way down to the RAM (possibly for more than one core), which effectively cost a lot..
Bruno, (here is the followup to your question as i'm getting smarter on this topic)
Main memory does not need to be updated if a cache line is invalidated, but has not been modified. Main memory only needs to be updated if the cache line has been modified. Note: The Cache coherency protocol for multi-core processors is no different than for a system with multiple single core processors. No additional thread optimizations are required beyond those simply because you are likely to have many more threads running in a multi-core system.
Thanks for the answer, Chris.
But still, I maintain my point that cache management still needs to be optimized for data dependence between cores inside a multi-core.
From what you made me realize, in your previous answer, is that the only chanel of communicating data between cores is through the RAM (with the expense implied).
If two cores of the same pair hold a copy of the same datum in their respective L1, this datum is marked as "shared", which is fine.
But, at the same time, their respective portion of their "shared" L2 each holds a copy of that same datum.
Then, when one of the two writes to that datum, since it is mark as "shared", instead of acting as a write-back cache, its personnal L1 will act as a write-through and update the corresponding L2 copy of the datum.
At the same time, the L1 copy of that second core (and probably its L2 copy) will be marked as "invalide", which is also fine.
But here's my point, if that second core needs to read that datum again, the RAM original copy of the datum will need to be updated first so that it can get it, through a L1-miss and a L2-miss.
This is because the L2 copy of the datum, in the L2 portion of that second core, cannot be updated directly from the modified copy found in the L2 portion of that first core, even though both cores share the L2 cache.
Is this assessment correct ?
If it is, it means that you cannot benefit from the proximity of the two cores in scheduling threads that have some data dependence between them.
Which is too bad...it could have been a good thing.
So, in fact as you said in your last message, there is not much difference in the behaviour of cores in a multi-core processor and single-cores in a multi-processor system, as long as data dependence is concerned.
Of course, I am talking only about the Xeon E5405 that I am using (actually I have two of them on my motherboard).
Sorry for the long reply and the abondance of details.
Bruno, You have taken me deeper into the architecture than i've ever been before. I journeyed to our technical marketing team and architects that worked on the these products. If this answer does not meet your needs, my suggestion is that you send me your email (via a private message in this forum) and i can put you in direct contact with someone who can have a conversation with you more interactively. Here is the response from the technical team
Penryn family supported a feature called cache-to-cache transfer that does exactly this. With this feature enabled, if P0 L1 has the line in modified state, and P1 is asking for it, then we will transfer the line from P0 L1 to the L2 and P1 L1 (P0 and P1 is sharing the same L2) – *without* transferring the modified line into memory. However, this feature is enabled only in non-server parts. So, any part that is fused as “DP Enabled”, have this feature disabled. This is based on performance studies on server configurations and server work loads (where cache-to-cache transfer seem to lose performance).
Since Harpertown is a server part (DP Enabled), this feature is disabled. So yes, in Harpertown, if P0 L1 has the line modified state, and P1 is asking for it (P0 and P1 are on the same die sharing the L2), P0 will first write it to memory and then P1 will get it. The snoop, resulting from the request from P1 (RFO or DRead) looking up L2, will hit P0 L1 and it will respond with HitM. At this point L2 lookup is considered a miss (even if L2 hit), and the request is transferred to the FSB via a Read for Ownership (RFO) or DRead, and that will send another locking snoop to the P0 L1. P0 L1 will send another HitM response. Now, based on FSB availability a snoop confirm is sent to P0 L1 which will result in P0 L1 modified data being put on the FSB, destined to memory. This data will be received from the FSB and forwarded to P1 L1 as well as L2.
The description in the original question at the bottom of the thread is slightly incorrect. When P0 and P1 hold the line in shared state, and P0 wants to modify it, P0 L1 *does not* act as a write through cache. The RFO resulting from P0’s intention of modifying the line, results in all copies of the line (including P1’s) be invalidated, and P0 L1 getting the ownership. Then, the write by P0 will only be in P0 L1, and *will not* be sent to L2.