Matching processors to Big Data needs

Friday, January 8, 2016

Professor John Goodacre, Director of Technology and Systems, ARM John Goodacre

At ARM, we create intellectual property for processor design. The company, which recently celebrated its 25th birthday, has achieved enormous success: around 75 billion processor cores have been shipped into a myriad of different devices, suggesting great versatility across application domains. But could low-power processors such as ARM’s really handle Big Data workloads? In this write-up, I’d like to provide a high-level overview of how Big Data applications place requirements on the design of a processor core.

There are two main issues to take into consideration: first, the bandwidth available to transfer data to/from the processor; and second, the maximum number of operations per clock cycle that the processor can perform.

As far as moving data to and from the processor is concerned, the main constraints are the external memory devices and input/output (I/O) interfaces rather than the processor core IP. ARM sells processor core IP rather than complete integrated circuits, and it is up to the silicon partner to build a complete system-on-chip, which integrates the ARM IP into a single package, together with appropriate I/O and memory interfaces. Inside the chip, the ARM IP interfaces to the rest of the silicon functionality using a high-speed internal memory bus such as AMBA. The ARM Cortex-A72, for example, even with a modest clock frequency of 1 GHz, can support an internal bus capable of reading and writing a total of 32 GB/s, which is sufficient to saturate four channels of dual data-rate (DDR) memory. Only new serial memory devices, such as Micron’s HMC, are capable of breaking through this memory wall.

Designed appropriately, the integrated system-on-chip could also handle 100s of GB/s of I/O. So, as far as moving large amounts of data through a microprocessor device is concerned, the applicability towards Big Data application of a chip depends more on the design of the memory and I/O components than the processor design, meaning that a modern power-efficient ARM core can certainly be competitive.

The second key challenge in processor design relates to the number of operations that the arithmetic unit can carry out within a single clock cycle. This depends predominantly on the processor’s issue width, which is the maximum number of instructions per cycle (IPC) that the processor can execute.  It also depends, however, on the characteristics of the software code, including the dependencies between instructions, i.e. whether multiple instructions are free of dependencies and can be executed in parallel in the same cycle.

The Big Data processing landscape is today dominated by multicore superscalar processors. Such processors are preferred because the alternative of continually extending the issue width of a single-core processor quickly reaches diminishing returns: an exponentially increasing cost in silicon and energy consumption generates only a small increase in performance.

The ratio of operations to byte of I/O is defined by any specific application. If a Big Data application has a small number of operations per unit of data, then total capital and operational costs can be greatly reduced using a larger number of smaller processors such as the Cavium, ARM architecture based, ThunderX. This device is perfectly capable of meeting the needs of applications that focus more on data movement than compute. If an application needs more operations per unit of data, then devices using a wide-issue processor such as the Cortex-A57 processor as used in the AMD “Seattle” devices could be more appropriate.

As Big Data processing needs grow, the challenge is to generally match the processor’s designed capability to application requirements, which can only be done precisely through an understanding of application needs and the co-design between the hardware and software. However, since only a few vertical businesses have the necessary resources and commercial volume to match an ideal processor design to each application, most designs will have to take a more general-purpose approach. This could be achieved through over-provisioning a single design, or through offering multiple processors in the same chip, each processor with different capabilities. The latter approach, otherwise known as heterogeneous multicore processing, means that only the most appropriate processors for the job in hand need to be powered up. By design, this processor best matches the type and number of operations required by the application. To maximize cost efficiency, such heterogeneity should be tightly integrated into the design. Heterogeneous processing poses specific challenges for software developers – but that’s a story for another day.