Therefore the maximum GFLOPS is 32*68*1.4 = 3046.4 DP GFLOPS. For a single core the peak FLOPS is 32*1.6 = 51.2 DP GLOPS. Dense matrix multiplication is one of the few operations that actually is capable of getting close to the peak flops. The Intel MKL library provides optimized dense matrix multiplication functions. On a Sandy Bridge systems ... Take, for example, the Intel Xeon E5-2680 "Sandy Bridge" processors in Stampede where I work. The specs are: 2.7GHz; 2 chips/node, 8 cores/chip; 2 vector instructions/cycle ; 256-bit wide AVX instructions (4 simultaneous double-precision operands) Multiplying those gives 345.6 GF/node or 2.2 PF for the un-accelerated part of the system. We usually think in terms of double-precision (64-bit ... I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2. This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle?,and here, Sandy-Bridge CPU specification.

