I’m George Chrysos, and I led the architecture development of the first Intel® Xeon Phi™ Coprocessor, codename Knights Corner, based on the Intel Many-Integrated-Core (Intel MIC) architecture. When Intel kicked off the Knights Corner program, which was first publicly announced at ISC in May 2010 the project was an exciting opportunity to optimize processor architecture for a specific class of workloads. The Intel MIC Architecture specifically targets highly parallel technical applications in physics, chemistry, biology and financial services – so Examples include weather prediction and climate modeling, fluid dynamics, quantum chromo-dynamics, protein folding, genetics, and options modeling. This allowed us to focus our design and adopt a master of some, rather than jack-of-all-trades approach. I refer to this set of workloads as highly-parallel high-performance computing, or highly parallel HPC.
My aspiration for the Intel MIC architecture is to provide an efficient and robust computing platform for scientists and engineers who are improving the lives of people everywhere by advancing medical research to find cures for cancer and other diseases, increasing the early warning notice time for weather disasters, making energy exploration more efficient, facilitating the design and development of new technologies, and improving the efficiency of global trade.
So, what does an architecture that optimizes for highly parallel HPC look like? We started with three premises:
Maximizing reuse of existing software ecosystem and practices, (i.e. the ability to reuse existing source code and support Intel’s standard software development environment) would be important to make the architecture more useful and more widely applicable to our customers;
The applications are very highly parallel, meaning there are plenty of tasks and threads of work;
Power efficiency or performance-per-watt on the target workloads is the key metric of goodness.
Re-use of the existing software ecosystem is a key attribute of the Intel MIC architecture. Knights Corner does not require the workloads to be re-written in a new programming language, or require a programmer to cope with a software-managed memory coherency and consistency model. From a programmer’s standpoint, the architecture is simply a multi-core processor. Knights Corner runs a standard full service OS; it is a networked node that you can telnet into or communicate to via MPI or sockets programming. Fortran, C, C++ code can be compiled and run correctly on the coprocessor. The OpenMP threading library is supported, and multiple MPI tasks can run simultaneously. VTune can be used for performance characterization and tuning, and the standard Intel debuggers, compilers and libraries are available. In short, it’s a just a computer. But one might ask: “what does this support cost?” The answer might surprise you. l estimated the cost of supporting standard Intel CPU specifics in the core at less than 2% of the area -- accounting for the full chip, it is even less. The value to programmers for this minimal investment is huge.
To optimize the architecture for performance/watt of highly parallel HPC workloads we made three architectural investments:
We built cores that are smaller than the Intel® Xeon® processor cores. The Knights Corner cores do less speculative work than these cores. Doing large amounts of speculative instruction processing is a great way to speed up a single thread of execution, but sometimes the speculative work is aborted and does not contribute to program progress. When there are no other threads (or processes) around, this is the right tradeoff, but when there are an abundance of threads, then instead of doing speculative work for a single thread, we simply choose another thread that has instructions ready to go. Speculative work that is aborted consumes energy that in a highly parallel workload we would rather spend on progress being made by another thread. The Knights Corner cores support 4 thread contexts and execute instructions in program order. Also, many micro-architectural choices were made to optimize the cores specifically for HPC workloads. Some of the design decisions we made were: 1) building the L1 data cache to do both a 512b load and a 512b store per cycle, 2) adding a large L2 TLB, 3) providing an ample 512KB L2 cache per core, and 4) adding a hardware pre-fetcher. In total the design choices we made improved Spec CPU FP 2006 by more than 80% per core, per cycle.
We introduced the widest SIMD instruction set offered by Intel to date, at 512-bits. In highly parallel programs, there are frequently inner loops that step regularly through memory, and perform the same math operations repeatedly. When these loops are assembled with traditional scalar instructions, the core incurs energy overhead for processing each instruction (fetching, decoding, reading the register file and the data cache, tracking dependencies, etc). SIMD amortizes that cost by doing all of the bookkeeping once and performing many math operations in just one instruction. In Knights Corner, a single vector floating point multiply-add (vfma) instruction will perform 32 single-precision or 16 double-precision operations. The wide SIMD instructions allow us to offer very high FLOP rates for computationally dense workloads under a constrained power budget. Wider SIMD can be challenging for an auto-vectorizing compiler or even an assembly programmer to use efficiently. The wider the SIMD, the more challenging it is to use all of the parallel ALUs. To mitigate this, the Knights Corner SIMD instruction set supports several new instruction set features. These are 1) mask registers, 2) gather/scatter, and 3) extended math unit operations. Mask registers allow for predicated execution per ALU, which supports vectorization across short conditional branches and supports efficient software pipelining. Gather/scatter instructions are vector loads and stores that are used when the memory address access patterns are not stride-1, but more irregular. To avoid serializing the code in such regions, gather and scatter instructions are supported. The extended math unit operations allow for high performance vectorized transcendental operations: square-root, reciprocal, logarithm and power. In total the wide SIMD instruction set is a great match for highly parallel programs.
To support more than 50 cores, we built a scalable high bandwidth interconnect and memory subsystem. The on-die interconnect is a high bandwidth bi-directional ring topology which connects the cores to one another as well as to many GDDR5 memory controllers. We placed the memory controllers symmetrically on the ring topology to avoid hot-spots and provide a smooth BW response. We also introduced a vector streaming store instruction that reduces the need to use memory BW when writing output-only arrays to memory.
On top of all of that, we put Intel’s world-class power management technology into Knights Corner so that when individual cores are idle, or Knights Corner is not processing anything, it reduces its power consumption proportionately. Knights Corner is a coprocessor to Intel® Xeon® processor, and both processors or one or the other can be used to get the best performance for a particular application. When not processing, Knights Corner consumes as little power as possible.
In summary, we optimized Intel® Xeon Phi™ coprocessor (codename Knights Corner) to deliver leading performance/watt for highly parallel technical computing workloads, while maintaining the Intel values of programmability and effective power management technology. I am looking forward to seeing this product family contributing to scientific and technical progress!
For more take a look at my presentation from Hotchips on the Intel MIC Architecture & the Xeon Phi.