IMSL Fortran Math Library
Programming Notes for BLAS Using NVIDIA
This reference material is intended for users who want to use the computational resources of their NVIDIA GPU board for BLAS. Users who do not have the NVIDIA GPU board can ignore this section.
Rationale, General Algorithm and an Example
NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in their Library, CUDA CUBLAS Library, V3.1, July, 2010. The NVIDIA external names and argument protocols are different from the equivalent Fortran names and argument addressing. See Table 9.2 for names marked in the color GREEN. IMSL has written these marked Fortran BLAS so that they call equivalent NVIDIA C language codes from the CUBLAS library. No direct use or knowledge of C is required by a Fortran programmer in order to take advantage of these codes. It is necessary that a user code or application package be compiled with a Fortran 2003 compiler that has implemented the C Interoperability Standard feature. See The Fortran 2003 Handbook, Adams, et al., p. 561. IMSL’s use of this feature is the key to providing a portable version of these Fortran-callable IMSL/NVIDIA BLAS. The program or application is then compiled and linked using IMSL and NVIDIA libraries that contain these BLAS.
Note: An NVIDIA Graphics Processing Unit (GPU) is required to take advantage of the BLAS.
The strategy for using the attached NVIDIA GPU is given by the following algorithm:
*If the maximum of vector or matrix dimensions are larger than a switchover array size, NSTART, and NVIDIA provides a CUBLAS code, then
*Copy the required vector and matrix data from the CPU to the GPU
*Compute the result on the GPU
*Copy the result from the GPU to the CPU
*Else, use the IMSL equivalent version of the BLAS routine that does not use the GPU.
Normally a code that calls a IMSL/NVIDIA BLAS code does not have to be aware of the copy steps or the switchover size, NSTART. These are hidden from the user code. In the first algorithm step, a working block is allocated on the GPU for each array argument. A table within the IMSL module, CUBLAS_LIBRARY, records the sizes and GPU addresses of these blocks. If the sizes are too small for the current problem size and data type the blocks are reallocated to be of adequate size. The same working block on the GPU may be used for many calls to the IMSL/NVIDIA BLAS. The IMSL versions of the BLAS also allow a user to define individual values of NSTART for each routine. This is important because using the GPU may be slower than using a CPU Fortran version until a switchover array size is reached. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default value of NSTART=32 is used for each vector/matrix argument of each routine but it may not be optimal. This default allows the routines to function correctly without initial attention to this value.
The user can change the default switchover value for all IMSL/NVIDIA BLAS vector/matrix arguments by setting NSTART to the desired value prior to calling the BLAS routine. Additionally, users can reset this value for each individual vector/matrix argument of the routines listed in Table 9.2 and marked with the color GREEN by using the IMSL routine CUBLAS_SET(…). Note that CUBLAS_SET cannot be used prior to an initial call to a BLAS code. The switchover values can be obtained using the IMSL routine CUBLAS_GET().
The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work.
As an added feature, the user can flag when the data values for a vector or matrix are present on the GPU and hence suppress the IMSL/NVIDIA BLAS code from first copying the data. This is often important since the data movement from the CPU to the GPU may be a significant part of the computation time. If there is no indication that the data is present, it is copied from the CPU to the GPU each time a routine is called. The necessity of copying for each use of a BLAS code depends on the application. Valid results are always copied back from the GPU to the CPU memory. The indication that data for that positional array argument requires no initial copy step is that the switchover value for that array argument is negative. The absolute value is used as the switchover value. Caution: Be sure and reset this to a positive value when the argument requires an initial copy step.
In Table 9.3 through Table 9.5, we list an enumeration that includes the routines in Table 9.2 marked with the color GREEN. Note the prefix to each name joined with the string ‘CUDABLAS_’. There are enumerated names that currently do not use the NVIDIA hardware. They are included in anticipation of future additions that will use the CUBLAS library.
There are four utility routines provided in the IMSL module CUDABLAS_LIBRARY that can be used to help manage the use of NVIDIA BLAS. These utilities appear in Table 9.7 and are described in more detail in the routines section of these notes.
For example, to set the value at 500 wherein the GPU is first used for the Level-2 routine ‘SGEMV’ first positional array argument, ‘A(*,*)’, i.e. Array_Arg = 1, execute the code:
USE CUDABLAS_LIBRARY
INTEGER ISWITCH, Array_Arg
ISWITCH=500
Array_Arg = 1
! Switch to using GPU when largest size of A(*,*)  > 500.
CALL CUBLAS_SET(CUDABLAS_SGEMV, Array_Arg, ISWITCH
When the positional array argument, ‘A(*,*)’ does not have to be copied for each subsequent use of ‘SGEMV:
USE CUDABLAS_LIBRARY
INTEGER ISWITCH, Array_Arg
Array_Arg = 1
ISWITCH=CUBLAS_GET(CUDABLAS_SGEMV, Array_Arg)
! Avoid copying data from CPU to GPU for subsequent calls to ‘SGEMV’
CALL CUBLAS_SET(CUDABLAS_SGEMV, Array_Arg, -abs(ISWITCH))
! Make several calls to ‘SGEMV’ with A(*,* )maintained unchanged on the GPU.
! Reset flag for copying A(*,*) when this matrix-vector product sequence is completed.
CALL CUBLAS_SET(CUDABLAS_SGEMV, Array_Arg, abs(ISWITCH))
Some NVIDIA hardware does not have working double precision versions of BLAS because there is no double precision arithmetic available. However, the double precision code itself is part of the CUDA CUBLAS library. It will appear to execute even though it will not give correct results when the device has no double precision arithmetic. When the IMSL software detects that the correct results are not returned, a warning error message will be printed. The user may instruct the application to henceforth use the Fortran code by setting the switchover value to zero. For example, if it is known that the hardware does not support DOUBLE PRECISION, then a code that has calls to ‘DGEMM’ will use an alternate version of this routine. Therefore, ignoring the error message and continuing the code will result in using the alternate version to compute the result. That code would include:
USE CUDABLAS_LIBRARY
! Flag first array argument A(*,*) to avoid use of the GPU for DGEMM:
CALL CUBLAS_SET(CUDABLAS_DGEMM,1,0)
If it is necessary to know if the GPU or the CPU version of ‘SGEMM’ was used following a call to that code, the inquiry code would include:
USE CUDABLAS_LIBRARY
! Get the current status for the last call to SGEMM with the INTEGER function
! CUBLAS_GET. The value ISWITCH=0 if an alternate was used, and ISWITCH=1 if the
! GPU was used.
ISWITCH = CUBLAS_GET(CUDABLAS_SGEMM, 4)
Enumeration of IMSL/NVIDIA BLAS
Table 9.3 — Enumeration of Level-1 BLAS
CUDABLAS_SROTG
CUDABLAS_DROTG
CUDABLAS_CROTG
CUDABLAS_ZROTG
CUDABLAS_SROTMG
CUDABLAS_DROTMG
 
 
CUDABLAS_SROT
CUDABLAS_DROT
CUDABLAS_CROT
CUDABLAS_ZROT
CUDABLAS_SROTM
CUDABLAS_DROTM
CUDABLAS_CSROT
CUDABLAS_ZSROT
CUDABLAS_SSWAP
CUDABLAS_DSWAP
CUDABLAS_CSWAP
CUDABLAS_ZSWAP
CUDABLAS_SCOPY
CUDABLAS_DCOPY
CUDABLAS_CCOPY
CUDABLAS_ZCOPY
CUDABLAS_SAXPY
CUDABLAS_DAXPY
CUDABLAS_CAXPY
CUDABLAS_ZAXPY
CUDABLAS_SDOT
CUDABLAS_DDOT
CUDABLAS_CDOTC
CUDABLAS_ZDOTC
CUDABLAS_SDSDOT
CUDABLAS_DSDOT
CUDABLAS_CDOTU
CUDABLAS_ZDOTU
CUDABLAS_SSCAL
CUDABLAS_DSCAL
CUDABLAS_CSCAL
CUDABLAS_ZSCAL
 
 
CUDABLAS_CSSCAL
CUDABLAS_ZSSCAL
CUDABLAS_SNRM2
CUDABLAS_DNRM2
CUDABLAS_SCNRM2
CUDABLAS_DZNRM2
CUDABLAS_SASUM
CUDABLAS_DASUM
CUDABLAS_SCASUM
CUDABLAS_DZASUM
CUDABLAS_ISAMIN
CUDABLAS_IDAMIN
CUDABLAS_ICAMIN
CUDABLAS_IZAMIN
CUDABLAS_ISAMAX
CUDABLAS_IDAMAX
CUDABLAS_ICAMAX
CUDABLAS_IZAMAX
Table 9.4 — Enumeration of Level-2 BLAS
CUDABLAS_SGEMV
CUDABLAS_DGEMV
CUDABLAS_CGEMV
CUDABLAS_ZGEMV
CUDABLAS_SGBMV
CUDABLAS_DGBMV
CUDABLAS_CGBMV
CUDABLAS_ZGBMV
CUDABLAS_SSYMV
CUDABLAS_DSYMV
CUDABLAS_CHEMV
CUDABLAS_ZHEMV
CUDABLAS_SSBMV
CUDABLAS_DSBMV
CUDABLAS_CHBMV
CUDABLAS_ZHBMV
CUDABLAS_SSPMV
CUDABLAS_DSPMV
CUDABLAS_CHPMV
CUDABLAS_ZHPMV
CUDABLAS_STRMV
CUDABLAS_DTRMV
CUDABLAS_CTRMV
CUDABLAS_ZTRMV
CUDABLAS_STBMV
CUDABLAS_DTBMV
CUDABLAS_CTBMV
CUDABLAS_ZTBMV
CUDABLAS_STPMV
CUDABLAS_DTPMV
CUDABLAS_CTPMV
CUDABLAS_ZTPMV
CUDABLAS_STRSV
CUDABLAS_DTRSV
CUDABLAS_CTRSV
CUDABLAS_ZTRSV
CUDABLAS_STBSV
CUDABLAS_DTBSV
CUDABLAS_CTBSV
CUDABLAS_ZTBSV
CUDABLAS_STPSV
CUDABLAS_DTPSV
CUDABLAS_CTPSV
CUDABLAS_ZTPSV
CUDABLAS_SGER
CUDABLAS_DGER
CUDABLAS_CGERU
CUDABLAS_ZGERU
 
 
CUDABLAS_CGERC
CUDABLAS_ZGERC
CUDABLAS_SSYR
CUDABLAS_DSYR
CUDABLAS_CHER
CUDABLAS_ZHER
CUDABLAS_SSYR2
CUDABLAS_DSYR2
CUDABLAS_CHER2
CUDABLAS_ZHER2
CUDABLAS_SSPR
CUDABLAS_DSPR
CUDABLAS_CHPR
CUDABLAS_ZHPR
CUDABLAS_SSPR2
CUDABLAS_DSPR2
CUDABLAS_CHPR2
CUDABLAS_ZHPR2
Table 9.5 — Enumeration of Level-3 BLAS
CUDABLAS_SGEMM
CUDABLAS_DGEMM
CUDABLAS_CGEMM
CUDABLAS_ZGEMM
CUDABLAS_SSYMM
CUDABLAS_DSYMM
CUDABLAS_CSYMM
CUDABLAS_ZSYMM
CUDABLAS_SSYRK
CUDABLAS_DSYRK
CUDABLAS_CSYRK
CUDABLAS_ZSYRK
CUDABLAS_SSYR2K
CUDABLAS_DSYR2K
CUDABLAS_CSYR2K
CUDABLAS_ZSYR2K
CUDABLAS_STRMM
CUDABLAS_DTRMM
CUDABLAS_CTRMM
CUDABLAS_ZTRMM
CUDABLAS_STRSM
CUDABLAS_DTRSM
CUDABLAS_CTRSM
CUDABLAS_ZTRSM
 
 
CUDABLAS_CHEMM
CUDABLAS_ZHEMM
 
 
CUDABLAS_CHERK
CUDABLAS_ZHERK
 
 
CUDABLAS_CHER2K
CUDABLAS_ZHER2K
Table 9.6 — Public Symbols and Parameters in Module CUDABLAS_LIBRARY
CUBLAS_STATUS_SUCCESS=0
CUBLAS_STATUS_NOT_INITIALIZED=1
CUBLAS_STATUS_ALLOC_FAILED=3
CUBLAS_STATUS_INVALID_VALUE=7
CUBLAS_STATUS_ARCH_MISMATCH=8
CUBLAS_STATUS_MAPPING_ERROR=11
CUBLAS_STATUS_EXECUTION_FAILED=13
CUBLAS_STATUS_INTERNAL_ERROR=14
FSIZE=4
DSIZE=8
CSIZE=8
ZSIZE=16
SKIND=kind(1.E0)
DKIND=kind(1.D0)
SZERO=0.E0
DZERO=0.D0
SONE=1.E0
DONE=1.D0
LEVEL=6 (IMSL Error or Warning Level)
NSTART(=32) (Default Switchover Value)
Table 9.7 — Subprograms Packaged in Module CUDABLAS_LIBRARY
Fortran Name Implemented in Module
Table 9.8 lists a number of NVIDIA Helper subprograms called within the CUDABLAS_LIBRARY Modules. These are mostly for internal use only but are documented in the case that a knowledgeable NVIDIA Library user chooses to make use of them.
Table 9.8 — NVIDIA Helper Subprograms Called in Module CUDABLAS_LIBRARY
Fortran Usage Name in Module
NVIDIA External C Name
ISW = cublasInit()
cublasInit()
ISW = cublasShutdown()
cublasShutdown()
ISW = cublasError()
cublasError()
ISW = cublasAlloc(n, datasize, c_ptr)
cublasAlloc(n, datasize, c_ptr)
ISW = cublasFree(c_ptr)
cublasFree(c_ptr)
ISW = cublasSetVector(n, datasize, x, incx, y, incy)
cublasSetVector
(n, datasize, x, incx, y, incy)
ISW = cublasGetVector(n, datasize, x, incx, y, incy)
cublasGetVector
(n, datasize, x, incx, y, incy)
ISW = cublasSetMatrix(m, n, datasize, A, lda, devA, ldd)
cublasSetMatrix(m,n, datasize, A, lda, devA, ldd)
ISW = cublasGetMatrix(m, n, datasize, devA, lda, B, ldb)
cublasGetMatrix(m, n, datasize, devA, lda, B, ldb)
In Table 9.8 the arguments c_ptr, x, y, A, devA, and B are C pointers to arrays either on the GPU or the CPU. These are instantiated with calls to helper routine cublasAlloc() or by use of the Fortran 2003 intrinsic function c_loc(…) for array arguments residing on the CPU. This intrinsic returns a C pointer to a Fortran object. The helper function cublasError()is called from each of the double precision IMSL/NVIDIA BLAS codes to assess the availability of double precision floating point hardware on the GPU.
The NVIDIA Environmental Subprograms listed in Table 9.9 provide details about the runtime working environment.
Table 9.9 — NVIDIA Environmental Subprograms
Fortran Usage Name in Module
NVIDIA External Name
ISW = cudaGetDeviceCount(ICOUNT)
cudaGetDeviceCount()
ISW = cudaSetDevice(IDEVICE),{0 indexed}
cudaSetDevice()
ISW = cudaGetDeviceProperties &
(<TYPE> cudaDeviceProp, IDEVICE )
cudaGetDeviceProperties()
One argument for cudaGetDeviceProperties is a Fortran derived type, cudaDeviceProp, with a C binding. This contains technical information about the device, including its name. This C character string, NAME(*), is terminated with C_NULL_CHAR. The derived type, cudaDeviceProp is described below:
TYPE, BIND(C) :: cudaDeviceProp
CHARACTER(C_CHAR) NAME(256)
INTEGER(C_SIZE_T) totalGlobalMem
INTEGER(C_SIZE_T) sharedMemPerBlock
INTEGER(C_INT) regsPerBlock
INTEGER(C_INT) warpSize
INTEGER(C_SIZE_T) memPitch
INTEGER(C_INT) maxThreadsPerBlock
INTEGER(C_INT) maxThreadsDim(3)
INTEGER(C_INT) maxGridSize(3)
INTEGER(C_SIZE_T) totalConstMem
INTEGER(C_INT) major
INTEGER(C_INT) minor
INTEGER(C_INT) clockRate
INTEGER(C_SIZE_T) textureAlignment
INTEGER(C_INT) deviceOverlap
INTEGER(C_INT) multiProcessorCount
INTEGER(C_INT) kernelExecTimeoutEnabled
INTEGER(C_INT) integrated
INTEGER(C_INT) canMapHostMemory
INTEGER(C_INT) computeMode
INTEGER(C_INT) concurrentKernels
END TYPE
Required NVIDIA Copyright Notice:
© 2005–2014 by NVIDIA Corporation. All rights reserved.
Portions of the NVIDIA SGEMM and DGEMM library routines were written by Vasily Volkov and are subject to the Modified Berkeley Software Distribution License as follows:
Copyright (©) 2007-09, Regents of the University of California
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. ( See CUDA CUBLAS Library,Version 3.1, July, 2010, for these remaining conditions.)