Although Chipzilla is not mentioning the speed of the chip it is starting to talk about some of its design decisions that it believes will make the Xeon Phi coprocessor a top HPC accelerator.
Spilling some of the beans at Hot Chips was George Chrysos, who was the chief architect of Knights Corner.
He said that Knights Corner microarchitecture aimed to pack a lot of number-crunching into a power-efficient package. It did this by sticking a vector processor onto a bare-bones x86 core.
Chrysos said that just two percent of the Knights Corner die is dedicated to decoding x86 instructions.
The rest of the chip is devoted to the L1 and L2 caches, the memory I/O, and the 512-bit vector unit.
This is the biggest vector unit developed by Intel and each one can manage 8 double precision or 16 single precision SIMD operations per clock cycle. Xeon Sandy Bridge and AMD Bulldozer can only manage half of that and since there will be 50-plus cores, that means that Knight’s Corner can manage 400 double precision flops per cycle. On a 2 GHz processor, that works out at 800 gigaflops and since Intel is using 22nm technology process it is going to be much more than that.
Chrysos added that the design uses other HPC features. There is a math accelerator called the Extended Map Unit. This does polynomial approximations of transcendent functions like square roots, reciprocals, and exponents to speed up execution of these functions in hardware. It still lacks a device to remove stones from horses’ hooves or a divide by your shoe size capability.
Also under the bonnet is a scatter-gather capability better known as a vector addressing or vector I/O. This improves storing and fetching of data from non-contiguous memory addresses.
Chrysos told HPC Wire that the Knights Corner cache will incorporate cache coherency in hardware but will be extended to handle lots of cores.
On Knights Corner, L2 cache is 512 KB per core which is twice the size of those on the Sandy Bridge Xeons. There is a translation lookaside buffer to speed address translation, tag directories to look at all of the cores’ L2 caches, and a Dcache capability to simultaneously load and store 512 bits per clock cycle.
Chrysos claims that these cache features have increased per core performance by an average of 80 percent.