Programming Notes for BLAS Using NVIDIA

NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in their Library, CUDA CUBLAS Library, V3.1, July, 2010. The NVIDIA external names and argument protocols are different from the equivalent Fortran names and argument addressing. See Table 9.2 for names marked in the color GREEN. IMSL has written these marked Fortran BLAS so that they call equivalent NVIDIA C language codes from the CUBLAS library. No direct use or knowledge of C is required by a Fortran programmer in order to take advantage of these codes. It is necessary that a user code or application package be compiled with a Fortran 2003 compiler that has implemented the C Interoperability Standard feature. See The Fortran 2003 Handbook, Adams, et al., p. 561. IMSL’s use of this feature is the key to providing a portable version of these Fortran-callable IMSL/NVIDIA BLAS. The program or application is then compiled and linked using IMSL and NVIDIA libraries that contain these BLAS.

The strategy for using the attached NVIDIA GPU is given by the following algorithm:

Note: An NVIDIA Graphics Processing Unit (GPU) is required to take advantage of the BLAS.

Normally a code that calls a IMSL/NVIDIA BLAS code does not have to be aware of the copy steps or the switchover size, NSTART. These are hidden from the user code. In the first algorithm step, a working block is allocated on the GPU for each array argument. A table within the IMSL module, CUBLAS_LIBRARY, records the sizes and GPU addresses of these blocks. If the sizes are too small for the current problem size and data type the blocks are reallocated to be of adequate size. The same working block on the GPU may be used for many calls to the IMSL/NVIDIA BLAS. The IMSL versions of the BLAS also allow a user to define individual values of NSTART for each routine. This is important because using the GPU may be slower than using a CPU Fortran version until a switchover array size is reached. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default value of NSTART=32 is used for each vector/matrix argument of each routine but it may not be optimal. This default allows the routines to function correctly without initial attention to this value.

The user can change the default switchover value for all IMSL/NVIDIA BLAS vector/matrix arguments by setting NSTART to the desired value prior to calling the BLAS routine. Additionally, users can reset this value for each individual vector/matrix argument of the routines listed in Table 9.2 and marked with the color GREEN by using the IMSL routine CUBLAS_SET(…). Note that CUBLAS_SET cannot be used prior to an initial call to a BLAS code. The switchover values can be obtained using the IMSL routine CUBLAS_GET(…).

The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work.

As an added feature, the user can flag when the data values for a vector or matrix are present on the GPU and hence suppress the IMSL/NVIDIA BLAS code from first copying the data. This is often important since the data movement from the CPU to the GPU may be a significant part of the computation time. If there is no indication that the data is present, it is copied from the CPU to the GPU each time a routine is called. The necessity of copying for each use of a BLAS code depends on the application. Valid results are always copied back from the GPU to the CPU memory. The indication that data for that positional array argument requires no initial copy step is that the switchover value for that array argument is negative. The absolute value is used as the switchover value. Caution: Be sure and reset this to a positive value when the argument requires an initial copy step.

In Table 9.3 through Table 9.5, we list an enumeration that includes the routines in Table 9.2 marked with the color GREEN. Note the prefix to each name joined with the string ‘CUDABLAS_’. There are enumerated names that currently do not use the NVIDIA hardware. They are included in anticipation of future additions that will use the CUBLAS library.

There are four utility routines provided in the IMSL module CUDABLAS_LIBRARY that can be used to help manage the use of NVIDIA BLAS. These utilities appear in Table 9.7 and are described in more detail in the routines section of these notes.

For example, to set the value at 500 wherein the GPU is first used for the Level-2 routine ‘SGEMV’ first positional array argument, ‘A(*,*)’, i.e. Array_Arg = 1, execute the code:

USE CUDABLAS_LIBRARY

When the positional array argument, ‘A(*,*)’ does not have to be copied for each subsequent use of ‘SGEMV’:

USE CUDABLAS_LIBRARY

Some NVIDIA hardware does not have working double precision versions of BLAS because there is no double precision arithmetic available. However, the double precision code itself is part of the CUDA CUBLAS library. It will appear to execute even though it will not give correct results when the device has no double precision arithmetic. When the IMSL software detects that the correct results are not returned, a warning error message will be printed. The user may instruct the application to henceforth use the Fortran code by setting the switchover value to zero. For example, if it is known that the hardware does not support DOUBLE PRECISION, then a code that has calls to ‘DGEMM’ will use an alternate version of this routine. Therefore, ignoring the error message and continuing the code will result in using the alternate version to compute the result. That code would include:

USE CUDABLAS_LIBRARY

If it is necessary to know if the GPU or the CPU version of ‘SGEMM’ was used following a call to that code, the inquiry code would include:

USE CUDABLAS_LIBRARY

Table 9.3 — Enumeration of Level-1 BLAS
CUDABLAS_SROTG	CUDABLAS_DROTG	CUDABLAS_CROTG	CUDABLAS_ZROTG
CUDABLAS_SROTMG	CUDABLAS_DROTMG
CUDABLAS_SROT	CUDABLAS_DROT	CUDABLAS_CROT	CUDABLAS_ZROT
CUDABLAS_SROTM	CUDABLAS_DROTM	CUDABLAS_CSROT	CUDABLAS_ZSROT
CUDABLAS_SSWAP	CUDABLAS_DSWAP	CUDABLAS_CSWAP	CUDABLAS_ZSWAP
CUDABLAS_SCOPY	CUDABLAS_DCOPY	CUDABLAS_CCOPY	CUDABLAS_ZCOPY
CUDABLAS_SAXPY	CUDABLAS_DAXPY	CUDABLAS_CAXPY	CUDABLAS_ZAXPY
CUDABLAS_SDOT	CUDABLAS_DDOT	CUDABLAS_CDOTC	CUDABLAS_ZDOTC
CUDABLAS_SDSDOT	CUDABLAS_DSDOT	CUDABLAS_CDOTU	CUDABLAS_ZDOTU
CUDABLAS_SSCAL	CUDABLAS_DSCAL	CUDABLAS_CSCAL	CUDABLAS_ZSCAL
		CUDABLAS_CSSCAL	CUDABLAS_ZSSCAL
CUDABLAS_SNRM2	CUDABLAS_DNRM2	CUDABLAS_SCNRM2	CUDABLAS_DZNRM2
CUDABLAS_SASUM	CUDABLAS_DASUM	CUDABLAS_SCASUM	CUDABLAS_DZASUM
CUDABLAS_ISAMIN	CUDABLAS_IDAMIN	CUDABLAS_ICAMIN	CUDABLAS_IZAMIN
CUDABLAS_ISAMAX	CUDABLAS_IDAMAX	CUDABLAS_ICAMAX	CUDABLAS_IZAMAX

Table 9.4 — Enumeration of Level-2 BLAS
CUDABLAS_SGEMV	CUDABLAS_DGEMV	CUDABLAS_CGEMV	CUDABLAS_ZGEMV
CUDABLAS_SGBMV	CUDABLAS_DGBMV	CUDABLAS_CGBMV	CUDABLAS_ZGBMV
CUDABLAS_SSYMV	CUDABLAS_DSYMV	CUDABLAS_CHEMV	CUDABLAS_ZHEMV
CUDABLAS_SSBMV	CUDABLAS_DSBMV	CUDABLAS_CHBMV	CUDABLAS_ZHBMV
CUDABLAS_SSPMV	CUDABLAS_DSPMV	CUDABLAS_CHPMV	CUDABLAS_ZHPMV
CUDABLAS_STRMV	CUDABLAS_DTRMV	CUDABLAS_CTRMV	CUDABLAS_ZTRMV
CUDABLAS_STBMV	CUDABLAS_DTBMV	CUDABLAS_CTBMV	CUDABLAS_ZTBMV
CUDABLAS_STPMV	CUDABLAS_DTPMV	CUDABLAS_CTPMV	CUDABLAS_ZTPMV
CUDABLAS_STRSV	CUDABLAS_DTRSV	CUDABLAS_CTRSV	CUDABLAS_ZTRSV
CUDABLAS_STBSV	CUDABLAS_DTBSV	CUDABLAS_CTBSV	CUDABLAS_ZTBSV
CUDABLAS_STPSV	CUDABLAS_DTPSV	CUDABLAS_CTPSV	CUDABLAS_ZTPSV
CUDABLAS_SGER	CUDABLAS_DGER	CUDABLAS_CGERU	CUDABLAS_ZGERU
		CUDABLAS_CGERC	CUDABLAS_ZGERC
CUDABLAS_SSYR	CUDABLAS_DSYR	CUDABLAS_CHER	CUDABLAS_ZHER
CUDABLAS_SSYR2	CUDABLAS_DSYR2	CUDABLAS_CHER2	CUDABLAS_ZHER2
CUDABLAS_SSPR	CUDABLAS_DSPR	CUDABLAS_CHPR	CUDABLAS_ZHPR
CUDABLAS_SSPR2	CUDABLAS_DSPR2	CUDABLAS_CHPR2	CUDABLAS_ZHPR2

Table 9.5 — Enumeration of Level-3 BLAS
CUDABLAS_SGEMM	CUDABLAS_DGEMM	CUDABLAS_CGEMM	CUDABLAS_ZGEMM
CUDABLAS_SSYMM	CUDABLAS_DSYMM	CUDABLAS_CSYMM	CUDABLAS_ZSYMM
CUDABLAS_SSYRK	CUDABLAS_DSYRK	CUDABLAS_CSYRK	CUDABLAS_ZSYRK
CUDABLAS_SSYR2K	CUDABLAS_DSYR2K	CUDABLAS_CSYR2K	CUDABLAS_ZSYR2K
CUDABLAS_STRMM	CUDABLAS_DTRMM	CUDABLAS_CTRMM	CUDABLAS_ZTRMM
CUDABLAS_STRSM	CUDABLAS_DTRSM	CUDABLAS_CTRSM	CUDABLAS_ZTRSM
		CUDABLAS_CHEMM	CUDABLAS_ZHEMM
		CUDABLAS_CHERK	CUDABLAS_ZHERK
		CUDABLAS_CHER2K	CUDABLAS_ZHER2K

Table 9.6 — Public Symbols and Parameters in Module CUDABLAS_LIBRARY
CUBLAS_STATUS_SUCCESS=0	CUBLAS_STATUS_NOT_INITIALIZED=1
CUBLAS_STATUS_ALLOC_FAILED=3	CUBLAS_STATUS_INVALID_VALUE=7
CUBLAS_STATUS_ARCH_MISMATCH=8	CUBLAS_STATUS_MAPPING_ERROR=11
CUBLAS_STATUS_EXECUTION_FAILED=13	CUBLAS_STATUS_INTERNAL_ERROR=14
FSIZE=4	DSIZE=8
CSIZE=8	ZSIZE=16
SKIND=kind(1.E0)	DKIND=kind(1.D0)
SZERO=0.E0	DZERO=0.D0
SONE=1.E0	DONE=1.D0
LEVEL=6 (IMSL Error or Warning Level)	NSTART(=32) (Default Switchover Value)

Table 9.7 — Subprograms Packaged in Module CUDABLAS_LIBRARY
Fortran Name Implemented in Module
CUBLAS_GET
CUBLAS_SET
CHECK_BUFFER_ALLOCATION
CUDA_ERROR_PRINT

Table 9.8 lists a number of NVIDIA Helper subprograms called within the CUDABLAS_LIBRARY Modules. These are mostly for internal use only but are documented in the case that a knowledgeable NVIDIA Library user chooses to make use of them.

Table 9.8 — NVIDIA Helper Subprograms Called in Module CUDABLAS_LIBRARY
Fortran Usage Name in Module	NVIDIA External C Name
ISW = cublasInit()	cublasInit()
ISW = cublasShutdown()	cublasShutdown()
ISW = cublasError()	cublasError()
ISW = cublasAlloc(n, datasize, c_ptr)	cublasAlloc(n, datasize, c_ptr)
ISW = cublasFree(c_ptr)	cublasFree(c_ptr)
ISW = cublasSetVector(n, datasize, x, incx, y, incy)	cublasSetVector (n, datasize, x, incx, y, incy)
ISW = cublasGetVector(n, datasize, x, incx, y, incy)	cublasGetVector (n, datasize, x, incx, y, incy)
ISW = cublasSetMatrix(m, n, datasize, A, lda, devA, ldd)	cublasSetMatrix(m,n, datasize, A, lda, devA, ldd)
ISW = cublasGetMatrix(m, n, datasize, devA, lda, B, ldb)	cublasGetMatrix(m, n, datasize, devA, lda, B, ldb)

In Table 9.8 the arguments c_ptr, x, y, A, devA, and B are C pointers to arrays either on the GPU or the CPU. These are instantiated with calls to helper routine cublasAlloc() or by use of the Fortran 2003 intrinsic function c_loc(…) for array arguments residing on the CPU. This intrinsic returns a C pointer to a Fortran object. The helper function cublasError()is called from each of the double precision IMSL/NVIDIA BLAS codes to assess the availability of double precision floating point hardware on the GPU.

The NVIDIA Environmental Subprograms listed in Table 9.9 provide details about the runtime working environment.

Table 9.9 — NVIDIA Environmental Subprograms
Fortran Usage Name in Module	NVIDIA External Name
ISW = cudaGetDeviceCount(ICOUNT)	cudaGetDeviceCount()
ISW = cudaSetDevice(IDEVICE),{0 indexed}	cudaSetDevice()
ISW = cudaGetDeviceProperties & (<TYPE> cudaDeviceProp, IDEVICE )	cudaGetDeviceProperties()

One argument for cudaGetDeviceProperties is a Fortran derived type, cudaDeviceProp, with a C binding. This contains technical information about the device, including its name. This C character string, NAME(*), is terminated with C_NULL_CHAR. The derived type, cudaDeviceProp is described below:


TYPE, BIND(C) :: cudaDeviceProp CHARACTER(C_CHAR) NAME(256) INTEGER(C_SIZE_T) totalGlobalMem INTEGER(C_SIZE_T) sharedMemPerBlock INTEGER(C_INT) regsPerBlock INTEGER(C_INT) warpSize INTEGER(C_SIZE_T) memPitch INTEGER(C_INT) maxThreadsPerBlock INTEGER(C_INT) maxThreadsDim(3) INTEGER(C_INT) maxGridSize(3) INTEGER(C_SIZE_T) totalConstMem INTEGER(C_INT) major INTEGER(C_INT) minor INTEGER(C_INT) clockRate INTEGER(C_SIZE_T) textureAlignment INTEGER(C_INT) deviceOverlap INTEGER(C_INT) multiProcessorCount INTEGER(C_INT) kernelExecTimeoutEnabled INTEGER(C_INT) integrated INTEGER(C_INT) canMapHostMemory INTEGER(C_INT) computeMode INTEGER(C_INT) concurrentKernels END TYPE

Portions of the NVIDIA SGEMM and DGEMM library routines were written by Vasily Volkov and are subject to the Modified Berkeley Software Distribution License as follows:

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. ( See CUDA CUBLAS Library,Version 3.1, July, 2010, for these remaining conditions.)