Skip to main content

Custom Math Functions for High Performance Computing

Implementations of various transcendental math functions, including "erfc", with no conditional branches.

Date Posted: May 10, 2006

alphaworks tab navigation

 

What is Custom Math Functions for High Performance Computing?

Supercomputer use is typically limited by the electricity bill, and math functions have a direct role in this cost. Math functions such as "log", "exp", "arcsin", "erfc", are usually implemented as library functions with conditional branches. The conditional branches inhibit the compiler's ability to perform various optimization transformations; this inhibition, in turn, limits the performance of the application to below the theoretical peak of the hardware, thus increasing the computational costs. Custom Math Functions for High Performance Computing removes the conditional branches that inherently drive up exponentially the number of calculations.

For example, the IBM BlueMatter protein-folding application, which is being developed in order to exploit the IBM eServer BlueGene supercomputer, requires a number of math functions to be available at high performance. Compared with alternatives such as "MASS", requirements are relaxed for domain and accuracy. A number of the math functions implemented in that application were described in the IBM Journal of Research and Development. Custom Math Functions for High Performance Computing makes source code for these functions available to others who may wish to use and build on them. Source code is in C++, and IBM is using it on PowerPC "440d" architecture; it may be useful as a starting point for other architectures and languages.

How does it work?

This technology provides "branchless" implementations of selected math functions, enabling the application to exploit the hardware to better effect, as reported in the "Blue Gene" issue of the IBM Journal of Research and Development (Volume 49, Number 2/3, 2005).

Modern microprocessors typically operate in a pipelined fashion and may have replicated arithmetic units. The IBM "440D" PowerPC processor core, which is used in the node chips for BlueGene (and is available for designers to use in other embedded processors), has a five-cycle, floating-point pipeline and two floating-point, "multiply-add" units. Sequentially-dependent arithmetic, such as

a=b*x+c*y+d*z

will run at a rate of one addition and one multiplication per five cycles; but, if there is independent arithmetic to do, then

a0=b0*x0+c0*y0+d0*z0, 
   a1=b1*x1+c1*y1+d1*z1, 
   ... , a9=b9*x9+c9*y9+d9*z9

will run at a rate of ten additions and ten multiplications per five cycles. In the best case, this can reduce the time that the application needs by a factor of ten.

Achieving this "peak performance" is straightforward for linear algebra, but synthesization of other mathematical functions, such as "log", "sin", etc., tends to require conditional branches; each of the ten streams would need to branch in its own direction, and this branching would upset the possibility of keeping the pipeline full.

A number of such mathematical functions are in use in the IBM Research project for understanding the atomic and molecular behavior behind the process that happens when light is absorbed by the proteins on the back of the retina; the intent is to spend one year of processing on 20480 dual-processor-chips' worth of BlueGene, running a model of this process. The faster this model runs, the more we will be able to understand about the process of vision.

The BlueGene science project is explained at the IBM Research Blue Gene Project Page, and the design of the functions is shown in the "Blue Gene" issue of the IBM Journal of Research and Development. alphaWorks completes the work by making the functions themselves available.

The functions are expressed as C++ source code. It is intended for them to be included as header files in an application that needs them; an optimizing compiler should then be able to achieve the best performance available for the application. The IBM science application uses the IBM xlC compiler to good effect.

It is possible to translate the functions from C++ to Fortran, C, Java, or other languages to support applications written in those languages; and it may be possible to use them on hardware architecture other than IBM "440d."

About the technology author(s)

Chris Ward

Chris Ward joined IBM in 1982 as a graduate engineer from Cambridge University, England. From 2001 to 2003, he was assigned to IBM Research, Yorktown Heights, where he worked as part of the team responsible for bringing the IBM eServer BlueGene from concept to production use. Mr. Ward currently works as an advisory software engineer for IBM Software Group Application Integration Middleware in Hursley, England.

IBM supercomputing services and IBM Websphere are both capable of supporting commercial clients, in their different ways, and both are most valuable when they are backed by IBM's integration services capability.

Trademarks