I presented one of those late afternoon IDF sessions on the last day of this year's Fall IDF. I was pleasantly surprised there are more than 50 folks in attendance to listen to SSE4.2.
In the Q&A session, one attendee asked about an important question relating to software usage of unaligned SIMD instructions to deal with memory overrun issue.
Unaligned SIMD instructions with 16 byte memory semantics are growing. They used to be limited to load and store instructions to work around the 16-byte memory alignment restrictions that hardware imposed on software. In SSE4.2, we have four new computational instructions that perform lexical processing and removed the 16-byte alignment restrictions. In Intel Advanced Vector Extensions, all of the VEX-encoded computational and data processing instructions are also removing alignment restrictions on 32-byte and 16-byte memory load semantics. The hardware trend is that Intel architecture is making the SIMD programming environment more flexible and convenient for software innovations.
Memory overrun is a relatively recent issue in using SIMD unaligned instructions. There are actually two aspects: A page fault might occur due to memory overrun across page boundary. Page fault by itself does not cause functional correctness issue. Such a situation merely manifest as extra delay in the overall execution cycles. It is possible that if a caller allocates a buffer region which ends at a page boundary and the next page happens to be owned by OS. In such a case memory overrun by an unaligned SIMD instruction reading from an address close to that page boundary introduces a new situation that did not exist in the x86 era. But it is not new relative to the usage of SSE4.2. Video processing techniques use unaligned loads frequently and memory outside the video frame boundary will be loaded via unaligned load instructions and data outside the video frame will be discarded.
How should software deal with the distinction of allocating a memory buffer having two distinct range requirements? In the traditional x86 sense, the range of valid buffer data is the same as the permitted address range for memory access. As memory access granularity grows to 16 bytes and 32 bytes and unaligned memory access will become even more prevalent, is it reasonable to insist software should continue to operate within the confines to the x86 programming model.
The spirit of software innovation is that we evolve, we can adapt. One simple solution is that caller's buffer allocation size requirements should adapt to newer SIMD unaligned memory semantics and add a small pad size. Essentially, this is merely the familiar notion of a scrap area/work buffer, attached at the end of what the traditional x86 simplistic buffer allocation schemes would allocates. Insisting software usage of unaligned SIMD memory semantics to add overhead to check for page boundary and adding prolog/epilog to avoid memory overrun is a bit like throwing the baby out with bath water.
Some more discussion can be found in the Intel AVX spec, Chapter 2.
If you are interested in SSE4.2, you might want to take a look at the IDF presentation.