Skip to main content

Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units

A sparse matrix-vector multiplication library optimized for NVIDIA GPUs using CUDA

Date Posted: January 28, 2009

alphaworks tab navigation


 

Update: April 21, 2009
The source file is now available for download.

 

What is Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units?

Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs).

The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price.

These GPUs also support high-level parallel programming models, for example, NVIDIA's Common Unified Device Architecture (CUDA) or Brook+ from AMD, that enable users to develop parallel applications that use the CPU as the host and the GPU as an accelerator.

Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing.

SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA CUDA interfaces, and works on all NVIDIA GPUs that support this library. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA's CUDPP library.

While the current implementation of the SpMV code uses the CUDA interfaces, the code can be easily migrated to use the upcoming OpenCL standard. This will allow the SpMV code to execute on a wide range of GPU architectures.

How does it work?

The SpMV4GPU package is provided as a library in a binary format, currently for x86/Linux. The application invokes the library and links it using the CUDA runtime system provided by NVIDIA. The package supports all versions of CUDA above version 2.0.

Our CUDA implementation of the SpMV kernel employs both compile-time and run-time optimizations. The compile-time optimizations include:

  1. Exploiting synchronization-free parallelism
  2. Optimized thread mapping based on the affinity towards optimal memory access pattern
  3. Optimal off-chip memory access to tolerate the high latency
  4. Exploiting data reuse.

The runtime optimizations involve a runtime inspection of the sparse matrix to determine dense non-zero sub-blocks, which facilitate the reuse of input vector elements during execution. Internally, the library uses a new blocked storage format for storing and accessing elements of a sparse matrix in an optimized manner from the GPU memories. These optimizations result in performance improvement by a factor of two to four over the NVIDIA CUDPP library.

Our library interfaces shield users from the intricacies of programming the GPU system. For example, the user is not required to specify how the array would be partitioned, how the threads would be mapped or how the arrays are laid out in the GPU memory. The library interfaces enable users to feed in the sparse matrix using the traditional compressed column or row format. In addition, the users can provide additional information about the sparse matrix structure that can be used by the library for fine-tuning the optimization strategies.

Sparse Matrix-Vector multiplication (SpMV) is one of the most important kernels in scientific computing. Learn more about Optimizing Sparse Matrix-Vector Multiplication on GPUs Using Compile-time and Run-time Strategies.

About the technology author(s)

Rajesh Bordawekar is a Research Staff Member in the Programming Technologies Department at the IBM T. J. Watson Research Center. He received his B.E. in Electrical Engineering from University of Mumbai, and M.S. and Ph.D. degrees in Computer Engineering from Syracuse University. He was a post-doctoral fellow at the Center for Advanced Computing Research, California Institute of Technology, from 1996 to 1998. His research interests include XML processing, Java memory management, compiler optimizations, optimizing non-HPC applications for modern multi-core architectures, and concurrent search algorithms.

Muthu Manikandan Baskaran is currently a Ph.D. candidate, advised by Dr. P. Sadayappan, in the Department of Computer Science and Engineering at Ohio State University. He received his BE degree in Computer Science and Engineering from the National Institute of Technology, Trichy, India, in 2002. He was a Research Intern at the IBM T.J. Watson Research Center from June, 2008 until December, 2008. His research interests include compiler optimizations and techniques, automatic parallelization and optimizations for modern parallel architectures, and parallel programming models. He is currently involved in designing and exploring automatic compiler optimizations for modern many-core architectures like GPUs, and in developing techniques for automatic parallelization on modern parallel architectures.

Trademarks




Related technologies