Multi-Core FFT Performance on Intel(r) Sandy Bridge Processors
Embedded Computing Design, June 19, 2014
This paper examines scalability of computational performance with the Intel(r) Sandy Bridge
multicore architecture, particularly when used on Mercury processor boards.
We look specifically at two device types, an 8-core Xeon (E5-2648L clocked at 1.8 GHz) and a 4-core
Core i7 (2715QE clocked at 2.1 GHz). These are relatively low-power devices which are suitable for
use on embedded processor boards. The algorithm of interest is a complex FFT (fft_copx) from the
Mercury MathPack library, which represents the numerically intensive processing requirements typical
of many embedded signal processing systems. One question to be addressed is whether there is a falloff
in performance as the FFT algorithm is run on one or more cores of the same device simultaneously.
Even though the cores are independent and have their own L1 and L2 cache, they share access to an
L3 cache and to the DRAM memory controller. As the FFT size increases from 1K (1024 points) to
1M (1,048,576 points), demands on the memory subsystem increase as the cores compete for this limited resource.
This means that for some applications, you cannot simply measure the timing on a single core, and then
scale the results to size your system - the falloff in performance as all cores are utilized must be taken into account.