IMSL C Stat Library
Programming Notes for Using NVIDIA® CUDA™ Toolkit
This reference material is intended for users who want to use the computational resources of their NVIDIA GPU board for numeric processing when using the IMSL C Numerical Library. Users who do not have the NVIDIA GPU board can ignore this section.
Rationale and General Algorithm
NVIDIA® CUDA™ technology leverages the massively parallel processing power of NVIDIA GPUs. The NVIDIA CUDA Toolkit provides functions which can be used as building blocks for an application taking advantage of this technology. IMSL C Numerical Library has incorporated the use of some of these functions to improve the overall performance of the library.
No direct use or knowledge of the NVIDIA CUDA Toolkit is required to take advantage of these functions. The program or application is simply rebuilt using environment variables which link with the NVIDIA CUDA Toolkit libraries.
The strategy for using the NVIDIA GPU is given by the following algorithm:
*If an NVIDIA-enabled version of an IMSL function is called and the maximum of vector or matrix dimensions are greater than or equal to a threshold value, then
*Copy the required vector and matrix data from the CPU to the GPU
*Compute the result on the GPU
*Copy the result from the GPU to the CPU
*Else, use the IMSL equivalent version of the function that does not use the GPU.
Normally a code that calls an IMSL/NVIDIA code does not have to be aware of the copy steps or the threshold size. These are hidden from the user code. Users have the option of changing the threshold size. This is important because using the GPU may be slower than using a CPU version of the code until array sizes become sufficiently large. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default threshold value is 32 but it may not be optimal. This default allows the functions to perform correctly without initial attention to this value.
The user can change the threshold value for all or specific IMSL/NVIDIA functions by using the IMSL function imsls_cuda_set. The threshold values can be obtained using the IMSL function imsls_cuda_get.
The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work.
Implementation
Basic Linear Algebra Subprograms
IMSL C Numerical Library incorporates the use of many Basic Linear Algebra Subprograms (BLAS) throughout the product. These functions are named using IMSL conventions and used internally. They are not accessible directly by the user.
NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in the NVIDIA CUDA Toolkit. The NVIDIA external names and argument protocols are different from those used by the IMSL C Numerical Library. Wrappers have been written to allow for the IMSL C Numerical Library to access selected routines in the NVIDIA CUDA Toolkit.
In Table 16, we document an enumeration that includes those BLAS for which a CUDA Toolkit implementation is provided in the IMSL C Numerical Library. The naming convention used is the name of the BLAS function prefaced by ‘IMSLS_CUDA_’.
Utility Functions
There are three utility functions provided in the IMSL C Stat Library that can be used to help manage the use of NVIDIA CUDA Toolkit. These utilities appear in Table 17 and are described in more detail in their corresponding function descriptions.
Note: Some NVIDIA hardware does not have working double precision versions of BLAS because there is no double precision arithmetic available. However, the double precision code itself is part of the NVIDIA CUDA Toolkit library. It will appear to execute even though it will not give correct results when the device has no double precision arithmetic. When the IMSL software detects that the correct results are not returned, a warning error message will be printed and the IMSL equivalent of the function which does not use the GPU will be used. The user can eliminate this error by using function imsls_cuda_set to set the threshold value to zero.
Table 16 – Enumeration of BLAS
IMSLS_CUDA_SGEMV
IMSLS_CUDA_DGBMV
IMSLS_CUDA_SGER
IMSLS_CUDA_DGER
IMSLS_CUDA_SSYR
IMSLS_CUDA_DSYR
IMSLS_CUDA_SGEMM
IMSLS_CUDA_DGEMM
Table 17 – NVIDIA CUDA Toolkit Utilities
Required NVIDIA Copyright Notice:
© 2005–2011 by NVIDIA Corporation. All rights reserved.
Portions of the NVIDIA SGEMM and DGEMM library routines were written by Vasily Volkov and are subject to the Modified Berkeley Software Distribution License as follows:
Copyright (©) 2007-09, Regents of the University of California
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (See CUDA Toolkit 4.0, CUBLAS Library, April, 2011, for these remaining conditions.)